[jira] [Commented] (PHOENIX-2521) Support duplicate rows in CSV Bulk Loader

James Taylor (JIRA) Sat, 23 Jan 2016 12:02:56 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113929#comment-15113929
 ]


James Taylor commented on PHOENIX-2521:
---------------------------------------

That's correct, [[email protected]] - the CSV Bulk Loader does not 
handle the case when there are duplicate rows.

> Support duplicate rows in CSV Bulk Loader
> -----------------------------------------
>
>                 Key: PHOENIX-2521
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2521
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.5.2
>            Reporter: Afshin Moazami
>
>  found out the map reduce csv bulk load tool doesn't behave the same as 
> UPSERTs. Is it by design or a bug?
> Here is the queries for creating table and index:
> {code} CREATE TABLE mySchema.mainTable (
> id varchar NOT NULL,
> name varchar,
> address varchar
> CONSTRAINT pk PRIMARY KEY (id)); {code}
> {code} CREATE INDEX myIndex 
> ON mySchema.mainTable  (name, id) 
> INCLUDE (address); {code}
> if I execute two upserts where the second one update the name (which is the 
> key for index), everything works fine (the record will be updated in both 
> table and index table)
> {code} UPSERT INTO mySchema.mainTable (id, name, address) values ('1', 
> 'john', 'Montreal');{code}
> {code}UPSERT INTO mySchema.mainTable (id, name, address) values ('1', 'jack', 
> 'Montreal');{code}
> {code}SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from 
> mySchema.mainTable where name = 'jack'; {code}  ==> one record
> {code}SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from 
> mySchema.mainTable where name = 'john';  {code}  ==> zero records
> But, if I load the date using org.apache.phoenix.mapreduce.CsvBulkLoadTool to 
> the main table, it behaves different. The main table will be updated, but the 
> new record will be appended to the index table:
> HADOOP_CLASSPATH=/usr/lib/hbase/lib/hbase-protocol-1.1.2.jar:/etc/hbase/conf 
> hadoop jar  
> /usr/lib/hbase/phoenix-4.5.2-HBase-1.1-bin/phoenix-4.5.2-HBase-1.1-client.jar 
> org.apache.phoenix.mapreduce.CsvBulkLoadTool -d',' -s mySchema -t mainTable 
> -i /tmp/input.txt 
> input.txt:
> 2,tomas,montreal
> 2,george,montreal
> (I have tried it both with/without -it and got the same result)
> {code}SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from 
> mySchema.mainTable where name = 'tomas' {code} ==> one record;
> {code} SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from 
> mySchema.mainTable where name = 'george' {code} ==> one record;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-2521) Support duplicate rows in CSV Bulk Loader

Reply via email to