Slight rectification in typo of the query syntax " select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a, c_indextest2 b where a.id=b.id;"
On Fri, Oct 20, 2017 at 7:54 AM, sounak <[email protected]> wrote: > Internally, we call a UDF to generate the TupleId and based on that > tupleId we decide if the row is duplicated or not. You can run a slightly > tweaked query > > " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a, > c_indextest2 > b where a.id=b.id; " > > In order to prompt Multiple Input Rows most probably the getTupleId() will > give duplicate entries. > > In between your given query runs fine with the data i have generated in > house. So this is data dependent. So it will be good if we get > *origin_data* or the script which has build the data for origin_data. > > On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <[email protected]> wrote: > >> dev >> one row record >> >> >> yixu2001 >> >> From: Liang Chen >> Date: 2017-10-19 22:26 >> To: dev >> Subject: Re: Re: Update statement failed with "Multiple input rows >> matched for same row" in version 1.2.0, >> Hi >> >> Execute the below query, return one row record or multiple row records ? >> --------------------------------- >> select a.remark from c_indextest1 a where a.id=b.id >> >> Regards >> Liang >> >> >> yixu2001 wrote >> > dev >> > You can follow the steps below to reproduce the problem. >> > tables c_indextest2 has 1700w records and table c_indextest1 has about >> 30w >> > records. >> > >> > step 1: >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark >> string) >> > STORED BY 'carbondata'").show; >> > >> > step 2: origin_data is a existing table with 1700w records, and fields >> in >> > the table do not matter. >> > cc.sql("insert into c_indextest2 select row_number() over(partition by >> > a.PKid order by a.pkid) id,a.remark from (SELECT '1' PKID,'dfsdd' >> remark >> > from origin_data limit 17000000) a").show; >> > >> > step 3: >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark >> string) >> > STORED BY 'carbondata'").show; >> > >> > step 4: >> > cc.sql("insert into c_indextest1 select * from c_indextest2 where >> > pmod(cast(ID as int),50)=43").show; >> > >> > setp 5: >> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from >> > c_indextest1 a where a.id=b.id)").show; >> > >> > >> > yixu2001 >> > >> > From: sounak >> > Date: 2017-10-19 18:26 >> > To: dev >> > Subject: Re: Re: Update statement failed with "Multiple input rows >> matched >> > for same row" in version 1.2.0, >> > Hi Yixu, >> > >> > Thanks for the ddl. But i am particularly interested on the data in >> order >> > to reproduce the problem. Will it be feasible to share to data or if it >> is >> > huge then the scripts that generates it? I cannot access the data >> location >> > from public network. >> > >> > >> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 < >> >> > yixu2001@ >> >> > > wrote: >> > >> >> dev >> >> >> >> Step 1:I make a hive table qqdata2.h_indextest1: >> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`( >> >> `id` INT, >> >> `CUST_ORDER_ID` STRING, >> >> `ORDER_ITEM_IDATTR_ID` STRING, >> >> `ATTR_VALUE_IDATTR_VALUE` STRING, >> >> `CREATE_DATE` STRING, >> >> `UPDATE_DATE` STRING, >> >> `STATUS_CD` STRING, >> >> `STATUS_DATE` STRING, >> >> `AREA_ID` STRING, >> >> `REGION_CD` STRING, >> >> `UPDATE_STAFF` STRING, >> >> `CREATE_STAFF` STRING, >> >> `SHARDING_ID` STRING, >> >> `ORDER_ATTR_ID` STRING ) >> >> ROW FORMAT DELIMITED >> >> FIELDS TERMINATED BY '\u0001' >> >> LINES TERMINATED BY '\n' >> >> NULL DEFINED AS '' >> >> STORED AS INPUTFORMAT >> >> 'org.apache.hadoop.mapred.TextInputFormat' >> >> OUTPUTFORMAT >> >> 'org.apache.hadoop.hive.ql.io >> >> >> >> .HiveIgnoreKeyTextOutputFormat' >> >> LOCATION >> >> 'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1 >> >> >> >> '; >> >> >> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the data >> >> format in the csv file just as following >> >> >> >> 1939505130,171483932,287305502,813463930,20160709134396669, >> >> 201607101469099594,1299,20160711996575390,10,73,302063, >> 302064,127859875, >> >> 9999999 >> >> >> >> Step 2:I make a carbon table qqdata2.c_indextest1: >> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING, >> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID >> STRING,ATTR_VALUE_IDATTR_VALUE >> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD >> STRING,STATUS_DATE >> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF STRING,CREATE_STAFF >> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY 'carbondata' >> ") >> >> >> >> Step 3:Insert data: >> >> cc.sql("insert into qqdata2.c_indextest1 select * from >> >> qqdata2.h_indextest1").show(100,false); >> >> >> >> Step 4: Repeat from step1 to step 3, I make another carbon table >> >> qqdata2.c_indextest2 >> >> The record number of qqdata2.c_indextest1 is 30w, the record number of >> >> qqdata2.c_indextest2 is 1700w. >> >> >> >> >> >> >> >> yixu2001 >> >> >> >> From: sounak >> >> Date: 2017-10-19 14:13 >> >> To: dev >> >> Subject: Re: Update statement failed with "Multiple input rows matched >> >> for >> >> same row" in version 1.2.0, >> >> Hi Yixu, >> >> >> >> Can you please share the DDLs and the Data for the above problem with >> us? >> >> >> >> Thanks >> >> Sounak >> >> >> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 < >> >> > yixu2001@ >> >> > > wrote: >> >> >> >> > dev >> >> > >> >> > >> >> > In carbondata version 1.2.0, I execute "update" statement with >> >> sub-query, >> >> > it failed. >> >> > All the rows in the 2 tables are not duplicated, and the same >> statement >> >> > will succeed in carbondata version 1.1.1. >> >> > >> >> > The test log as following: >> >> > scala> cc.sql("select count(*), count(distinct id) from >> >> > qqdata2.c_indextest1").show(100,false); >> >> > +--------+------------------+ >> >> > |count(1)|count(DISTINCT id)| >> >> > +--------+------------------+ >> >> > |300000 |300000 | >> >> > +--------+------------------+ >> >> > >> >> > scala> cc.sql("select count(*), count(distinct id) from >> >> > qqdata2.c_indextest2").show(100,false); >> >> > +--------+------------------+ >> >> > |count(1)|count(DISTINCT id)| >> >> > +--------+------------------+ >> >> > |71223220|71223220 | >> >> > +--------+------------------+ >> >> > >> >> > scala> cc.sql("update qqdata2.c_indextest2 a >> >> set(a.CUST_ORDER_ID,a.ORDER_ >> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a. >> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a. >> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE >> R_ATTR_ID) >> >> = >> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_ >> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_ >> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b. >> >> SHARDING_ID,b.ORDER_ATTR_ID >> >> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false); >> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation of >> a >> >> > plan since it was too large. This behavior can be adjusted by setting >> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf. >> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn >> >> > >> >> > ][bigdata][Thread-1]Delete data operation is failed for >> >> > qqdata2.c_indextest2 >> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation >> is >> >> > failed due to failure in creating delete delta file for segment : >> null >> >> > block : null >> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in >> >> update >> >> > operationjava.lang.Exception: Multiple input rows matched for same >> row. >> >> > java.lang.RuntimeException: Update operation failed. Multiple input >> >> rows >> >> > matched for same row. >> >> > at scala.sys.package$.error(package.scala:27) >> >> > at org.apache.spark.sql.execution.command.ProjectForUpdateCommand. >> >> > processData(IUDCommands.scala:239) >> >> > at >> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run( >> >> > IUDCommands.scala:141) >> >> > at org.apache.spark.sql.execution.command.ExecutedCommandExec. >> >> > sideEffectResult$lzycompute(commands.scala:58) >> >> > at org.apache.spark.sql.execution.command.ExecutedCommandExec. >> >> > sideEffectResult(commands.scala:56) >> >> > at org.apache.spark.sql.execution.command.ExecutedCommandExec. >> >> > executeTake(commands.scala:71) >> >> > at org.apache.spark.sql.execution.CollectLimitExec. >> >> > executeCollect(limit.scala:38) >> >> > at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$ >> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378) >> >> > at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio >> nId( >> >> > SQLExecution.scala:57) >> >> > at >> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780) >> >> > at org.apache.spark.sql.Dataset.org >> >> > >> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377) >> >> > at org.apache.spark.sql.Dataset.org >> >> > >> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384) >> >> > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset. >> >> > scala:2120) >> >> > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset. >> >> > scala:2119) >> >> > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala >> :2810) >> >> > at org.apache.spark.sql.Dataset.head(Dataset.scala:2119) >> >> > at org.apache.spark.sql.Dataset.take(Dataset.scala:2334) >> >> > at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) >> >> > at org.apache.spark.sql.Dataset.show(Dataset.scala:640) >> >> > ... 50 elided >> >> > >> >> > >> >> > >> >> > yixu2001 >> >> > >> >> >> >> >> >> >> >> -- >> >> Thanks >> >> Sounak >> >> >> > >> > >> > >> > -- >> > Thanks >> > Sounak >> >> >> >> >> >> -- >> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5 >> .nabble.com/ >> > > > > -- > Thanks > Sounak > -- Thanks Sounak
