[
https://issues.apache.org/jira/browse/PHOENIX-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113929#comment-15113929
]
James Taylor commented on PHOENIX-2521:
---------------------------------------
That's correct, [[email protected]] - the CSV Bulk Loader does not
handle the case when there are duplicate rows.
> Support duplicate rows in CSV Bulk Loader
> -----------------------------------------
>
> Key: PHOENIX-2521
> URL: https://issues.apache.org/jira/browse/PHOENIX-2521
> Project: Phoenix
> Issue Type: Improvement
> Affects Versions: 4.5.2
> Reporter: Afshin Moazami
>
> found out the map reduce csv bulk load tool doesn't behave the same as
> UPSERTs. Is it by design or a bug?
> Here is the queries for creating table and index:
> {code} CREATE TABLE mySchema.mainTable (
> id varchar NOT NULL,
> name varchar,
> address varchar
> CONSTRAINT pk PRIMARY KEY (id)); {code}
> {code} CREATE INDEX myIndex
> ON mySchema.mainTable (name, id)
> INCLUDE (address); {code}
> if I execute two upserts where the second one update the name (which is the
> key for index), everything works fine (the record will be updated in both
> table and index table)
> {code} UPSERT INTO mySchema.mainTable (id, name, address) values ('1',
> 'john', 'Montreal');{code}
> {code}UPSERT INTO mySchema.mainTable (id, name, address) values ('1', 'jack',
> 'Montreal');{code}
> {code}SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from
> mySchema.mainTable where name = 'jack'; {code} ==> one record
> {code}SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from
> mySchema.mainTable where name = 'john'; {code} ==> zero records
> But, if I load the date using org.apache.phoenix.mapreduce.CsvBulkLoadTool to
> the main table, it behaves different. The main table will be updated, but the
> new record will be appended to the index table:
> HADOOP_CLASSPATH=/usr/lib/hbase/lib/hbase-protocol-1.1.2.jar:/etc/hbase/conf
> hadoop jar
> /usr/lib/hbase/phoenix-4.5.2-HBase-1.1-bin/phoenix-4.5.2-HBase-1.1-client.jar
> org.apache.phoenix.mapreduce.CsvBulkLoadTool -d',' -s mySchema -t mainTable
> -i /tmp/input.txt
> input.txt:
> 2,tomas,montreal
> 2,george,montreal
> (I have tried it both with/without -it and got the same result)
> {code}SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from
> mySchema.mainTable where name = 'tomas' {code} ==> one record;
> {code} SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from
> mySchema.mainTable where name = 'george' {code} ==> one record;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)