[jira] [Created] (HUDI-7773) Allow Users to extend S3/GCS HoodieIncrSource to bring in additional columns from upstream
Balaji Varadarajan created HUDI-7773: Summary: Allow Users to extend S3/GCS HoodieIncrSource to bring in additional columns from upstream Key: HUDI-7773 URL: https://issues.apache.org/jira/browse/HUDI-7773 Project: Apache Hudi Issue Type: Improvement Components: deltastreamer Reporter: Balaji Varadarajan Assignee: Balaji Varadarajan Current S3/GCS HoodieIncrSource reads file-paths from upstream tables and ingests to downstream tables. We need ability to extend this functionality by joining additional columns in the upstream table before writing to the downstream table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7674) Hudi CLI : Command "metadata validate-files" not using file listing to validate
Balaji Varadarajan created HUDI-7674: Summary: Hudi CLI : Command "metadata validate-files" not using file listing to validate Key: HUDI-7674 URL: https://issues.apache.org/jira/browse/HUDI-7674 Project: Apache Hudi Issue Type: Bug Reporter: Balaji Varadarajan Assignee: Balaji Varadarajan metadata validate-files is expected to compare file system view provided by metadata layer against raw file listing. but this is broken. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7008) Fixing usage of Kafka Avro deserializer w/ debezium sources
[ https://issues.apache.org/jira/browse/HUDI-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-7008: Assignee: sivabalan narayanan > Fixing usage of Kafka Avro deserializer w/ debezium sources > --- > > Key: HUDI-7008 > URL: https://issues.apache.org/jira/browse/HUDI-7008 > Project: Apache Hudi > Issue Type: Bug >Reporter: Balaji Varadarajan >Assignee: sivabalan narayanan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7008) Fixing usage of Kafka Avro deserializer w/ debezium sources
Balaji Varadarajan created HUDI-7008: Summary: Fixing usage of Kafka Avro deserializer w/ debezium sources Key: HUDI-7008 URL: https://issues.apache.org/jira/browse/HUDI-7008 Project: Apache Hudi Issue Type: Bug Reporter: Balaji Varadarajan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5933) Fix NullPointer Exception in MultiTableDeltaStreamer when Transformer_class config is not set
Balaji Varadarajan created HUDI-5933: Summary: Fix NullPointer Exception in MultiTableDeltaStreamer when Transformer_class config is not set Key: HUDI-5933 URL: https://issues.apache.org/jira/browse/HUDI-5933 Project: Apache Hudi Issue Type: Bug Reporter: Balaji Varadarajan Context : https://github.com/apache/hudi/pull/6726#issuecomment-1468270289 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-2761) IllegalArgException from timeline server when serving getLastestBaseFiles with multi-writer
[ https://issues.apache.org/jira/browse/HUDI-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448275#comment-17448275 ] Balaji Varadarajan commented on HUDI-2761: -- [~shivnarayan] :Not sure if I understood why you think (a) is infeasible. For this case, when the first time it fails, the driver would have already updated to latest commit so it should not error out unless another commit comes. (a) would keep/reduce the FS calls only from driver and (b) could cause increase in FS calls. I think we should look at handling this in RemoteHoodieTableFileSystemView and retry (once) before it gives up and the executor loads the filpystem view locally. Regarding the exception stack trace, I agree we can make it a INFO message without dumping the stack trace. > IllegalArgException from timeline server when serving getLastestBaseFiles > with multi-writer > --- > > Key: HUDI-2761 > URL: https://issues.apache.org/jira/browse/HUDI-2761 > Project: Apache Hudi > Issue Type: Bug >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.10.0 > > Attachments: Screen Shot 2021-11-15 at 8.27.11 AM.png, Screen Shot > 2021-11-15 at 8.27.33 AM.png, Screen Shot 2021-11-15 at 8.28.03 AM.png, > Screen Shot 2021-11-15 at 8.28.25 AM.png > > > When concurrent writes try to ingest to hudi, occasionally, we run into > IllegalArgumentException as below. Even though exception is seen, the actual > write succeeds though. > Here is what is happening from my understanding. > > Lets say table's latest commit is C3. > Writer1 tries to commit C4, writer2 tries to do C5 and writer3 tries to do C6 > (all 3 are non-overlapping and so expected to succeed) > I started C4 from writer1 and then switched to writer 2 and triggered C5 and > then did the same for writer3. > C4 went through fine for writer1 and succeeded. > for writer2, when timeline got instantiated, it's latest snapshot was C3, but > when it received the getLatestBaseFiles() request, latest commit was C4 and > so it throws an exception. Similar issue happend w/ writer3 as well. > > {code:java} > scala> df.write.format("hudi"). > | options(getQuickstartWriteConfigs). > | option(PRECOMBINE_FIELD.key(), "created_at"). > | option(RECORDKEY_FIELD.key(), "other"). > | option(PARTITIONPATH_FIELD.key(), "type"). > | option("hoodie.cleaner.policy.failed.writes","LAZY"). > | > option("hoodie.write.concurrency.mode","OPTIMISTIC_CONCURRENCY_CONTROL"). > | > option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider"). > | option("hoodie.write.lock.zookeeper.url","localhost"). > | option("hoodie.write.lock.zookeeper.port","2181"). > | option("hoodie.write.lock.zookeeper.lock_key","locks"). > | > option("hoodie.write.lock.zookeeper.base_path","/tmp/mw_testing/.locks"). > | option(TBL_NAME.key(), tableName). > | mode(Append). > | save(basePath) > 21/11/15 07:47:33 WARN HoodieSparkSqlWriter$: Commit time 2025074733457 > 21/11/15 07:47:35 WARN EmbeddedTimelineService: Started embedded timeline > server at 10.0.0.202:57644 > [Stage 2:> (0 > 21/11/15 07:47:39 > ERROR RequestHandler: Got runtime exception servicing request > partition=CreateEvent=2025074301094=file%3A%2Ftmp%2Fmw_testing%2Ftrial2=2025074301094=ce963fe977a9d2176fadecf16c223cb3b98d7f6f7aaaf41cd7855eb098aee47d > java.lang.IllegalArgumentException: Last known instant from client was > 2025074301094 but server has the following timeline > [[2025074301094__commit__COMPLETED], > [2025074731908__commit__COMPLETED]] > at > org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40) > at > org.apache.hudi.timeline.service.RequestHandler$ViewHandler.handle(RequestHandler.java:510) > at io.javalin.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:22) > at io.javalin.Javalin.lambda$addHandler$0(Javalin.java:606) > at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:46) > at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:17) > at io.javalin.core.JavalinServlet$service$1.invoke(JavalinServlet.kt:143) > at io.javalin.core.JavalinServlet$service$2.invoke(JavalinServlet.kt:41) > at io.javalin.core.JavalinServlet.service(JavalinServlet.kt:107) > at > io.javalin.core.util.JettyServerUtil$initialize$httpHandler$1.doHandle(JettyServerUtil.kt:72) > at >
[jira] [Created] (HUDI-2166) Support Alter table drop column
Balaji Varadarajan created HUDI-2166: Summary: Support Alter table drop column Key: HUDI-2166 URL: https://issues.apache.org/jira/browse/HUDI-2166 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: Balaji Varadarajan Assignee: pengzhiwei Just like adding and renaming columns, we need DML support for dropping column -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1741) Row Level TTL Support for records stored in Hudi
Balaji Varadarajan created HUDI-1741: Summary: Row Level TTL Support for records stored in Hudi Key: HUDI-1741 URL: https://issues.apache.org/jira/browse/HUDI-1741 Project: Apache Hudi Issue Type: New Feature Components: Utilities Reporter: Balaji Varadarajan For e:g : Have records only updated last month GH: https://github.com/apache/hudi/issues/2743 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1741) Row Level TTL Support for records stored in Hudi
[ https://issues.apache.org/jira/browse/HUDI-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311938#comment-17311938 ] Balaji Varadarajan commented on HUDI-1741: -- [~shivnarayan] : FYI > Row Level TTL Support for records stored in Hudi > > > Key: HUDI-1741 > URL: https://issues.apache.org/jira/browse/HUDI-1741 > Project: Apache Hudi > Issue Type: New Feature > Components: Utilities >Reporter: Balaji Varadarajan >Priority: Major > > For e:g : Have records only updated last month > > GH: https://github.com/apache/hudi/issues/2743 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1724) run_sync_tool support for hive3.1.2 on hadoop3.1.4
[ https://issues.apache.org/jira/browse/HUDI-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17309272#comment-17309272 ] Balaji Varadarajan commented on HUDI-1724: -- [~shivnarayan] : Can you please triage this > run_sync_tool support for hive3.1.2 on hadoop3.1.4 > -- > > Key: HUDI-1724 > URL: https://issues.apache.org/jira/browse/HUDI-1724 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Balaji Varadarajan >Priority: Major > > Context: https://github.com/apache/hudi/issues/2717 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1724) run_sync_tool support for hive3.1.2 on hadoop3.1.4
Balaji Varadarajan created HUDI-1724: Summary: run_sync_tool support for hive3.1.2 on hadoop3.1.4 Key: HUDI-1724 URL: https://issues.apache.org/jira/browse/HUDI-1724 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Reporter: Balaji Varadarajan Context: https://github.com/apache/hudi/issues/2717 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1711) Avro Schema Exception with Spark 3.0 in 0.7
[ https://issues.apache.org/jira/browse/HUDI-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307095#comment-17307095 ] Balaji Varadarajan commented on HUDI-1711: -- [~shivnarayan]: Can you triage this issue when you get a chance. > Avro Schema Exception with Spark 3.0 in 0.7 > --- > > Key: HUDI-1711 > URL: https://issues.apache.org/jira/browse/HUDI-1711 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Balaji Varadarajan >Priority: Major > > GH: [https://github.com/apache/hudi/issues/2705] > > > {{21/03/22 10:10:35 WARN util.package: Truncated the string representation of > a plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 21/03/22 10:10:35 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.RuntimeException: Error while decoding: > java.lang.NegativeArraySizeException: -1255727808 > createexternalrow(if (isnull(input[0, > struct, > true])) null else createexternalrow(if (input[0, > struct, > true].isNullAt) null else input[0, > struct, > true].id, if (input[0, > struct, > true].isNullAt) null else input[0, > struct, > true].name.toString, if (input[0, > struct, > true].isNullAt) null else input[0, > struct, > true].type.toString, if (input[0, > struct, > true].isNullAt) null else input[0, > struct, > true].url.toString, if (input[0, > struct, > true].isNullAt) null else input[0, > struct, > true].user.toString, if (input[0, > struct, > true].isNullAt) null else input[0, > struct, > true].password.toString, if (input[0, > struct, > true].isNullAt) null else input[0, > struct, > true].create_time.toString, if (input[0, > struct, > true].isNullAt) null else input[0, > struct, > true].create_user.toString, if (input[0, > struct, > true].isNullAt) null else input[0, > struct, > true].update_time.toString, if (input[0, > struct, > true].isNullAt) null else input[0, > struct, > true].update_user.toString, if (input[0, > struct, > true].isNullAt) null else input[0, > struct, > true].del_flag, StructField(id,IntegerType,false), > StructField(name,StringType,true), StructField(type,StringType,true), > StructField(url,StringType,true), StructField(user,StringType,true), > StructField(password,StringType,true), > StructField(create_time,StringType,true), > StructField(create_user,StringType,true), > StructField(update_time,StringType,true), > StructField(update_user,StringType,true), > StructField(del_flag,IntegerType,true)), if (isnull(input[1, > struct, > true])) null else createexternalrow(if (input[1, > struct, > true].isNullAt) null else input[1, > struct, > true].id, if (input[1, > struct, > true].isNullAt) null else input[1, > struct, > true].name.toString, if (input[1, > struct, > true].isNullAt) null else input[1, > struct, > true].type.toString, if (input[1, > struct, > true].isNullAt) null else input[1, > struct, > true].url.toString, if (input[1, > struct, > true].isNullAt) null else input[1, > struct, > true].user.toString, if (input[1, > struct, > true].isNullAt) null else input[1, > struct, > true].password.toString, if (input[1, > struct, > true].isNullAt) null else input[1, > struct, > true].create_time.toString, if (input[1, > struct, > true].isNullAt) null else input[1, > struct, > true].create_user.toString, if (input[1, > struct, > true].isNullAt) null else input[1, > struct, > true].update_time.toString, if (input[1, > struct, > true].isNullAt) null else input[1, > struct, > true].update_user.toString, if (input[1, > struct, > true].isNullAt) null else input[1, > struct, > true].del_flag, StructField(id,IntegerType,false), > StructField(name,StringType,true), StructField(type,StringType,true), > StructField(url,StringType,true), StructField(user,StringType,true), > StructField(password,StringType,true), > StructField(create_time,StringType,true), > StructField(create_user,StringType,true), > StructField(update_time,StringType,true), > StructField(update_user,StringType,true), > StructField(del_flag,IntegerType,true)), if (isnull(input[2, > struct, > false])) null else createexternalrow(if (input[2, > struct, > false].isNullAt) null else input[2, > struct, > false].version.toString, if (input[2, > struct, > false].isNullAt) null else input[2, > struct, > false].connector.toString, if (input[2, > struct, > false].isNullAt) null else input[2, > struct, > false].name.toString, if (input[2, > struct, > false].isNullAt) null else input[2, > struct, > false].ts_ms, if (input[2, > struct, > false].isNullAt) null else input[2, > struct, > false].snapshot.toString, if (input[2, > struct, > false].isNullAt) null else input[2, > struct, >
[jira] [Created] (HUDI-1711) Avro Schema Exception with Spark 3.0 in 0.7
Balaji Varadarajan created HUDI-1711: Summary: Avro Schema Exception with Spark 3.0 in 0.7 Key: HUDI-1711 URL: https://issues.apache.org/jira/browse/HUDI-1711 Project: Apache Hudi Issue Type: Bug Components: DeltaStreamer Reporter: Balaji Varadarajan GH: [https://github.com/apache/hudi/issues/2705] {{21/03/22 10:10:35 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. 21/03/22 10:10:35 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.RuntimeException: Error while decoding: java.lang.NegativeArraySizeException: -1255727808 createexternalrow(if (isnull(input[0, struct, true])) null else createexternalrow(if (input[0, struct, true].isNullAt) null else input[0, struct, true].id, if (input[0, struct, true].isNullAt) null else input[0, struct, true].name.toString, if (input[0, struct, true].isNullAt) null else input[0, struct, true].type.toString, if (input[0, struct, true].isNullAt) null else input[0, struct, true].url.toString, if (input[0, struct, true].isNullAt) null else input[0, struct, true].user.toString, if (input[0, struct, true].isNullAt) null else input[0, struct, true].password.toString, if (input[0, struct, true].isNullAt) null else input[0, struct, true].create_time.toString, if (input[0, struct, true].isNullAt) null else input[0, struct, true].create_user.toString, if (input[0, struct, true].isNullAt) null else input[0, struct, true].update_time.toString, if (input[0, struct, true].isNullAt) null else input[0, struct, true].update_user.toString, if (input[0, struct, true].isNullAt) null else input[0, struct, true].del_flag, StructField(id,IntegerType,false), StructField(name,StringType,true), StructField(type,StringType,true), StructField(url,StringType,true), StructField(user,StringType,true), StructField(password,StringType,true), StructField(create_time,StringType,true), StructField(create_user,StringType,true), StructField(update_time,StringType,true), StructField(update_user,StringType,true), StructField(del_flag,IntegerType,true)), if (isnull(input[1, struct, true])) null else createexternalrow(if (input[1, struct, true].isNullAt) null else input[1, struct, true].id, if (input[1, struct, true].isNullAt) null else input[1, struct, true].name.toString, if (input[1, struct, true].isNullAt) null else input[1, struct, true].type.toString, if (input[1, struct, true].isNullAt) null else input[1, struct, true].url.toString, if (input[1, struct, true].isNullAt) null else input[1, struct, true].user.toString, if (input[1, struct, true].isNullAt) null else input[1, struct, true].password.toString, if (input[1, struct, true].isNullAt) null else input[1, struct, true].create_time.toString, if (input[1, struct, true].isNullAt) null else input[1, struct, true].create_user.toString, if (input[1, struct, true].isNullAt) null else input[1, struct, true].update_time.toString, if (input[1, struct, true].isNullAt) null else input[1, struct, true].update_user.toString, if (input[1, struct, true].isNullAt) null else input[1, struct, true].del_flag, StructField(id,IntegerType,false), StructField(name,StringType,true), StructField(type,StringType,true), StructField(url,StringType,true), StructField(user,StringType,true), StructField(password,StringType,true), StructField(create_time,StringType,true), StructField(create_user,StringType,true), StructField(update_time,StringType,true), StructField(update_user,StringType,true), StructField(del_flag,IntegerType,true)), if (isnull(input[2, struct, false])) null else createexternalrow(if (input[2, struct, false].isNullAt) null else input[2, struct, false].version.toString, if (input[2, struct, false].isNullAt) null else input[2, struct, false].connector.toString, if (input[2, struct, false].isNullAt) null else input[2, struct, false].name.toString, if (input[2, struct, false].isNullAt) null else input[2, struct, false].ts_ms, if (input[2, struct, false].isNullAt) null else input[2, struct, false].snapshot.toString, if (input[2, struct, false].isNullAt) null else input[2, struct, false].db.toString, if (input[2, struct, false].isNullAt) null else input[2, struct, false].table.toString, if (input[2, struct, false].isNullAt) null else input[2, struct, false].server_id, if (input[2, struct, false].isNullAt) null else input[2, struct, false].gtid.toString, if (input[2, struct, false].isNullAt) null else input[2, struct, false].file.toString, if (input[2, struct, false].isNullAt) null else input[2, struct, false].pos, if (input[2, struct, false].isNullAt) null else input[2, struct, false].row, if (input[2, struct, false].isNullAt)
[jira] [Commented] (HUDI-1640) Implement Spark Datasource option to read hudi configs from properties file
[ https://issues.apache.org/jira/browse/HUDI-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290922#comment-17290922 ] Balaji Varadarajan commented on HUDI-1640: -- [~shivnarayan]: Can you vet this and add to the work queue ? > Implement Spark Datasource option to read hudi configs from properties file > --- > > Key: HUDI-1640 > URL: https://issues.apache.org/jira/browse/HUDI-1640 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Balaji Varadarajan >Priority: Major > > Provide config option like "hoodie.datasource.props.file" to load all the > options from a file. > > GH: https://github.com/apache/hudi/issues/2605 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1640) Implement Spark Datasource option to read hudi configs from properties file
Balaji Varadarajan created HUDI-1640: Summary: Implement Spark Datasource option to read hudi configs from properties file Key: HUDI-1640 URL: https://issues.apache.org/jira/browse/HUDI-1640 Project: Apache Hudi Issue Type: Improvement Components: Spark Integration Reporter: Balaji Varadarajan Provide config option like "hoodie.datasource.props.file" to load all the options from a file. GH: https://github.com/apache/hudi/issues/2605 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1608) MOR fetches all records for read optimized query w/ spark sql
[ https://issues.apache.org/jira/browse/HUDI-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282851#comment-17282851 ] Balaji Varadarajan commented on HUDI-1608: -- [~shivnarayan]: You need to set spark.sql.hive.convertMetastoreParquet=false (https://hudi.apache.org/docs/querying_data.html#spark-sql) > MOR fetches all records for read optimized query w/ spark sql > - > > Key: HUDI-1608 > URL: https://issues.apache.org/jira/browse/HUDI-1608 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Affects Versions: 0.7.0 >Reporter: sivabalan narayanan >Priority: Major > Labels: sev:critical, user-support-issues > > Script to reproduce in local spark: > > [https://gist.github.com/nsivabalan/7250b794788516f1aec35650c2632364] > > ``` > scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, > _hoodie_partition_path, id, __op from hudi_trips_snapshot order by > _hoodie_record_key").show(false) > +-++++-+ > |_hoodie_commit_time|_hoodie_record_key|_hoodie_partition_path|id|__op| > +-++++-+ > |20210210070347 |1 |1970-01-01 |1 |null| > |20210210070347 |2 |1970-01-01 |2 |null| > |20210210070347 |3 |2020-01-04 |3 |D | > |20210210070347 |4 |1998-04-13 |4 |I | > |20210210070347 |5 |2020-01-01 |5 |I | > |*20210210070445* |*6* |*1998-04-13* |*6* |*I* | > +-++++-+ > ``` > After an upsert, read optimized query returns records from both C1 and C2. > Also, I don't find any log files in partitions. all of them are parquet > files. > > ls /tmp/hudi_trips_cow/1998-04-13/ > 0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-23-12025_20210210065058.parquet > 0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-61-25595_20210210065127.parquet > ls /tmp/hudi_trips_cow/1970-01-01/ > 7b836833-a656-485d-967a-871bdc653dc3-0_2-61-25596_20210210065127.parquet > 7b836833-a656-485d-967a-871bdc653dc3-0_3-23-12027_20210210065058.parquet > > Source of the issue: [https://github.com/apache/hudi/issues/2255] > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1523) Avoid excessive mkdir calls when creating new files
Balaji Varadarajan created HUDI-1523: Summary: Avoid excessive mkdir calls when creating new files Key: HUDI-1523 URL: https://issues.apache.org/jira/browse/HUDI-1523 Project: Apache Hudi Issue Type: Improvement Components: Writer Core Reporter: Balaji Varadarajan Fix For: 0.8.0 https://github.com/apache/hudi/issues/2423 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1505) Allow pluggable option to write error records to side table, queue
Balaji Varadarajan created HUDI-1505: Summary: Allow pluggable option to write error records to side table, queue Key: HUDI-1505 URL: https://issues.apache.org/jira/browse/HUDI-1505 Project: Apache Hudi Issue Type: New Feature Components: DeltaStreamer Reporter: Balaji Varadarajan Fix For: 0.8.0 Context : https://github.com/apache/hudi/issues/2401 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1501) Explore providing ways to auto-tune input record size based on incoming payload
Balaji Varadarajan created HUDI-1501: Summary: Explore providing ways to auto-tune input record size based on incoming payload Key: HUDI-1501 URL: https://issues.apache.org/jira/browse/HUDI-1501 Project: Apache Hudi Issue Type: New Feature Components: Writer Core Reporter: Balaji Varadarajan Fix For: 0.8.0 Context: https://github.com/apache/hudi/issues/2393#issuecomment-752452753 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1501) Explore providing ways to auto-tune input record size based on incoming payload
[ https://issues.apache.org/jira/browse/HUDI-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1501: - Status: Open (was: New) > Explore providing ways to auto-tune input record size based on incoming > payload > --- > > Key: HUDI-1501 > URL: https://issues.apache.org/jira/browse/HUDI-1501 > Project: Apache Hudi > Issue Type: New Feature > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Minor > Fix For: 0.8.0 > > > Context: https://github.com/apache/hudi/issues/2393#issuecomment-752452753 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1499) Support configuration to let user override record-size estimate
[ https://issues.apache.org/jira/browse/HUDI-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1499: Assignee: sivabalan narayanan > Support configuration to let user override record-size estimate > - > > Key: HUDI-1499 > URL: https://issues.apache.org/jira/browse/HUDI-1499 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: sivabalan narayanan >Priority: Major > Labels: newbie > Fix For: 0.8.0 > > > Context: [https://github.com/apache/hudi/issues/2393] > > It would be helpful if for some reason the user needs to ingest a batch of > records which has a very different record sizes compared to existing records. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1499) Support configuration to let user override record-size estimate
Balaji Varadarajan created HUDI-1499: Summary: Support configuration to let user override record-size estimate Key: HUDI-1499 URL: https://issues.apache.org/jira/browse/HUDI-1499 Project: Apache Hudi Issue Type: Improvement Components: Writer Core Reporter: Balaji Varadarajan Fix For: 0.8.0 Context: [https://github.com/apache/hudi/issues/2393] It would be helpful if for some reason the user needs to ingest a batch of records which has a very different record sizes compared to existing records. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1499) Support configuration to let user override record-size estimate
[ https://issues.apache.org/jira/browse/HUDI-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1499: - Status: Open (was: New) > Support configuration to let user override record-size estimate > - > > Key: HUDI-1499 > URL: https://issues.apache.org/jira/browse/HUDI-1499 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Major > Labels: newbie > Fix For: 0.8.0 > > > Context: [https://github.com/apache/hudi/issues/2393] > > It would be helpful if for some reason the user needs to ingest a batch of > records which has a very different record sizes compared to existing records. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1497) Timeout Exception during getFileStatus()
[ https://issues.apache.org/jira/browse/HUDI-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1497: - Status: Open (was: New) > Timeout Exception during getFileStatus() > - > > Key: HUDI-1497 > URL: https://issues.apache.org/jira/browse/HUDI-1497 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Major > > Seeing this happening when running RFC-15 branch in long running mode. There > could be a resource leak as I am seeing this consistently after every 1 or 2 > hour period runs. The below log shows it is during accessing bootstrap index > but I am seeing it in getFileStatus() for other files too. > > > Caused by: java.io.InterruptedIOException: getFileStatus on > s3://robinhood-encrypted-hudi-data-cove/dummy/balaji/sickle/public/client_ledger_clientledgerbalance/test_v4/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile: > com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout > waiting for connection from poolCaused by: java.io.InterruptedIOException: > getFileStatus on > s3://robinhood-encrypted-hudi-data-cove/dummy/balaji/sickle/public/client_ledger_clientledgerbalance/test_v4/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile: > com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout > waiting for connection from pool at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:141) at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:117) at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1859) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1823) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1763) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1627) at > org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:2500) at > org.apache.hudi.common.fs.HoodieWrapperFileSystem.exists(HoodieWrapperFileSystem.java:549) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.(HFileBootstrapIndex.java:102) > ... 33 moreCaused by: com.amazonaws.SdkClientException: Unable to execute > HTTP request: Timeout waiting for connection from pool at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1113) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1063) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649) > at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) at > com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4229) at > com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4176) at > com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1253) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1053) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1841) > ... 39 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1497) Timeout Exception during getFileStatus()
Balaji Varadarajan created HUDI-1497: Summary: Timeout Exception during getFileStatus() Key: HUDI-1497 URL: https://issues.apache.org/jira/browse/HUDI-1497 Project: Apache Hudi Issue Type: Sub-task Components: Writer Core Reporter: Balaji Varadarajan Seeing this happening when running RFC-15 branch in long running mode. There could be a resource leak as I am seeing this consistently after every 1 or 2 hour period runs. The below log shows it is during accessing bootstrap index but I am seeing it in getFileStatus() for other files too. Caused by: java.io.InterruptedIOException: getFileStatus on s3://robinhood-encrypted-hudi-data-cove/dummy/balaji/sickle/public/client_ledger_clientledgerbalance/test_v4/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from poolCaused by: java.io.InterruptedIOException: getFileStatus on s3://robinhood-encrypted-hudi-data-cove/dummy/balaji/sickle/public/client_ledger_clientledgerbalance/test_v4/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:141) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:117) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1859) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1823) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1763) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1627) at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:2500) at org.apache.hudi.common.fs.HoodieWrapperFileSystem.exists(HoodieWrapperFileSystem.java:549) at org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.(HFileBootstrapIndex.java:102) ... 33 moreCaused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1113) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1063) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4229) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4176) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1253) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1053) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1841) ... 39 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1496) Seek Error when querying MOR tables in GCP
[ https://issues.apache.org/jira/browse/HUDI-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1496: Assignee: sivabalan narayanan > Seek Error when querying MOR tables in GCP > -- > > Key: HUDI-1496 > URL: https://issues.apache.org/jira/browse/HUDI-1496 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: sivabalan narayanan >Priority: Major > > Context : [https://github.com/apache/hudi/issues/2367] > FSUtils.isGCSInputStream is not catching all the cases when reading from GCS. > IN some cases > ([https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76),|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76)] > the condition in isGCSInputStream breaks > > Instead of isGCSInputStream, we should detect GCSFileSystem by checking if > the filesystem scheme is StorageSchemes. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1496) Seek Error when querying MOR tables in GCP
[ https://issues.apache.org/jira/browse/HUDI-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1496: - Status: Open (was: New) > Seek Error when querying MOR tables in GCP > -- > > Key: HUDI-1496 > URL: https://issues.apache.org/jira/browse/HUDI-1496 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core >Reporter: Balaji Varadarajan >Priority: Major > > Context : [https://github.com/apache/hudi/issues/2367] > FSUtils.isGCSInputStream is not catching all the cases when reading from GCS. > IN some cases > ([https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76),|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76)] > the condition in isGCSInputStream breaks > > Instead of isGCSInputStream, we should detect GCSFileSystem by checking if > the filesystem scheme is StorageSchemes. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1496) Seek Error when querying MOR tables in GCP
Balaji Varadarajan created HUDI-1496: Summary: Seek Error when querying MOR tables in GCP Key: HUDI-1496 URL: https://issues.apache.org/jira/browse/HUDI-1496 Project: Apache Hudi Issue Type: Bug Components: Common Core Reporter: Balaji Varadarajan Context : [https://github.com/apache/hudi/issues/2367] FSUtils.isGCSInputStream is not catching all the cases when reading from GCS. IN some cases ([https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76),|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76)] the condition in isGCSInputStream breaks Instead of isGCSInputStream, we should detect GCSFileSystem by checking if the filesystem scheme is StorageSchemes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1490) Incremental Query fails if there are partitions that have no incremental changes
[ https://issues.apache.org/jira/browse/HUDI-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1490: - Status: Open (was: New) > Incremental Query fails if there are partitions that have no incremental > changes > > > Key: HUDI-1490 > URL: https://issues.apache.org/jira/browse/HUDI-1490 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.7.0 > > > Context: https://github.com/apache/hudi/issues/2362 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1490) Incremental Query fails if there are partitions that have no incremental changes
Balaji Varadarajan created HUDI-1490: Summary: Incremental Query fails if there are partitions that have no incremental changes Key: HUDI-1490 URL: https://issues.apache.org/jira/browse/HUDI-1490 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Reporter: Balaji Varadarajan Fix For: 0.7.0 Context: https://github.com/apache/hudi/issues/2362 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1475) Fix documentation of preCombine to clarify when this API is used by Hudi
[ https://issues.apache.org/jira/browse/HUDI-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17252023#comment-17252023 ] Balaji Varadarajan commented on HUDI-1475: -- Relevant Issue: https://github.com/apache/hudi/issues/2345 > Fix documentation of preCombine to clarify when this API is used by Hudi > - > > Key: HUDI-1475 > URL: https://issues.apache.org/jira/browse/HUDI-1475 > Project: Apache Hudi > Issue Type: Task > Components: Docs >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.7.0 > > > We need to fix the Javadoc of preCombine in HoodieRecordPayload to clarify > that this method is used to pre-merge unmerged (compaction) and incoming > records before the merge with existing record in the dataset. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1475) Fix documentation of preCombine to clarify when this API is used by Hudi
Balaji Varadarajan created HUDI-1475: Summary: Fix documentation of preCombine to clarify when this API is used by Hudi Key: HUDI-1475 URL: https://issues.apache.org/jira/browse/HUDI-1475 Project: Apache Hudi Issue Type: Task Components: Docs Reporter: Balaji Varadarajan Fix For: 0.7.0 We need to fix the Javadoc of preCombine in HoodieRecordPayload to clarify that this method is used to pre-merge unmerged (compaction) and incoming records before the merge with existing record in the dataset. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1475) Fix documentation of preCombine to clarify when this API is used by Hudi
[ https://issues.apache.org/jira/browse/HUDI-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1475: - Status: Open (was: New) > Fix documentation of preCombine to clarify when this API is used by Hudi > - > > Key: HUDI-1475 > URL: https://issues.apache.org/jira/browse/HUDI-1475 > Project: Apache Hudi > Issue Type: Task > Components: Docs >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.7.0 > > > We need to fix the Javadoc of preCombine in HoodieRecordPayload to clarify > that this method is used to pre-merge unmerged (compaction) and incoming > records before the merge with existing record in the dataset. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1452) RocksDB FileSystemView throwing NotSerializableError when embedded timeline server is turned off
[ https://issues.apache.org/jira/browse/HUDI-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1452: - Description: [https://github.com/apache/hudi/issues/2321] We need to make RocksDBFileSystemView lazy initializable so that it would seamlessly when run in executor. was: [https://github.com/apache/hudi/issues/2321] We need to make > RocksDB FileSystemView throwing NotSerializableError when embedded timeline > server is turned off > > > Key: HUDI-1452 > URL: https://issues.apache.org/jira/browse/HUDI-1452 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Sreeram Ramji >Priority: Major > > [https://github.com/apache/hudi/issues/2321] > > We need to make RocksDBFileSystemView lazy initializable so that it would > seamlessly when run in executor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1452) RocksDB FileSystemView throwing NotSerializableError when embedded timeline server is turned off
[ https://issues.apache.org/jira/browse/HUDI-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1452: - Description: [https://github.com/apache/hudi/issues/2321] We need to make was:https://github.com/apache/hudi/issues/2321 > RocksDB FileSystemView throwing NotSerializableError when embedded timeline > server is turned off > > > Key: HUDI-1452 > URL: https://issues.apache.org/jira/browse/HUDI-1452 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Sreeram Ramji >Priority: Major > > [https://github.com/apache/hudi/issues/2321] > > We need to make -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1452) RocksDB FileSystemView throwing NotSerializableError when embedded timeline server is turned off
[ https://issues.apache.org/jira/browse/HUDI-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1452: - Status: Open (was: New) > RocksDB FileSystemView throwing NotSerializableError when embedded timeline > server is turned off > > > Key: HUDI-1452 > URL: https://issues.apache.org/jira/browse/HUDI-1452 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Sreeram Ramji >Priority: Major > > https://github.com/apache/hudi/issues/2321 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1452) RocksDB FileSystemView throwing NotSerializableError when embedded timeline server is turned off
Balaji Varadarajan created HUDI-1452: Summary: RocksDB FileSystemView throwing NotSerializableError when embedded timeline server is turned off Key: HUDI-1452 URL: https://issues.apache.org/jira/browse/HUDI-1452 Project: Apache Hudi Issue Type: Bug Components: Common Core Reporter: Balaji Varadarajan https://github.com/apache/hudi/issues/2321 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1452) RocksDB FileSystemView throwing NotSerializableError when embedded timeline server is turned off
[ https://issues.apache.org/jira/browse/HUDI-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1452: Assignee: Sreeram Ramji > RocksDB FileSystemView throwing NotSerializableError when embedded timeline > server is turned off > > > Key: HUDI-1452 > URL: https://issues.apache.org/jira/browse/HUDI-1452 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Sreeram Ramji >Priority: Major > > https://github.com/apache/hudi/issues/2321 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1440) Allow option to override schema when doing spark.write
[ https://issues.apache.org/jira/browse/HUDI-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1440: - Status: Open (was: New) > Allow option to override schema when doing spark.write > -- > > Key: HUDI-1440 > URL: https://issues.apache.org/jira/browse/HUDI-1440 > Project: Apache Hudi > Issue Type: New Feature > Components: Spark Integration >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.8.0 > > > Need ability to pass schema and use it to create RDD when creating input > batch from data-frame. > > df.write.format("hudi").option("hudi.avro.schema", "").. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1440) Allow option to override schema when doing spark.write
Balaji Varadarajan created HUDI-1440: Summary: Allow option to override schema when doing spark.write Key: HUDI-1440 URL: https://issues.apache.org/jira/browse/HUDI-1440 Project: Apache Hudi Issue Type: New Feature Components: Spark Integration Reporter: Balaji Varadarajan Fix For: 0.8.0 Need ability to pass schema and use it to create RDD when creating input batch from data-frame. df.write.format("hudi").option("hudi.avro.schema", "").. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1436) Provide Option to run auto clean every nth commit.
Balaji Varadarajan created HUDI-1436: Summary: Provide Option to run auto clean every nth commit. Key: HUDI-1436 URL: https://issues.apache.org/jira/browse/HUDI-1436 Project: Apache Hudi Issue Type: New Feature Components: Cleaner Reporter: Balaji Varadarajan Fix For: 0.7.0 Need mechanism (just like compaction scheduling ( hoodie.compact.inline.max.delta.commits)) which lets scheduling clean every nth commit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1436) Provide Option to run auto clean every nth commit.
[ https://issues.apache.org/jira/browse/HUDI-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1436: - Status: Open (was: New) > Provide Option to run auto clean every nth commit. > --- > > Key: HUDI-1436 > URL: https://issues.apache.org/jira/browse/HUDI-1436 > Project: Apache Hudi > Issue Type: New Feature > Components: Cleaner >Reporter: Balaji Varadarajan >Assignee: Sreeram Ramji >Priority: Major > Fix For: 0.7.0 > > > Need mechanism (just like compaction scheduling ( > hoodie.compact.inline.max.delta.commits)) which lets scheduling clean every > nth commit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1436) Provide Option to run auto clean every nth commit.
[ https://issues.apache.org/jira/browse/HUDI-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1436: Assignee: Sreeram Ramji > Provide Option to run auto clean every nth commit. > --- > > Key: HUDI-1436 > URL: https://issues.apache.org/jira/browse/HUDI-1436 > Project: Apache Hudi > Issue Type: New Feature > Components: Cleaner >Reporter: Balaji Varadarajan >Assignee: Sreeram Ramji >Priority: Major > Fix For: 0.7.0 > > > Need mechanism (just like compaction scheduling ( > hoodie.compact.inline.max.delta.commits)) which lets scheduling clean every > nth commit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present
[ https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1435: - Status: Patch Available (was: In Progress) > Marker File Reconciliation failing for Non-Partitioned datasets when > duplicate marker files present > --- > > Key: HUDI-1435 > URL: https://issues.apache.org/jira/browse/HUDI-1435 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > GH : https://github.com/apache/hudi/issues/2294 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present
[ https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1435: - Status: Open (was: New) > Marker File Reconciliation failing for Non-Partitioned datasets when > duplicate marker files present > --- > > Key: HUDI-1435 > URL: https://issues.apache.org/jira/browse/HUDI-1435 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > GH : https://github.com/apache/hudi/issues/2294 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present
[ https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1435: - Status: In Progress (was: Open) > Marker File Reconciliation failing for Non-Partitioned datasets when > duplicate marker files present > --- > > Key: HUDI-1435 > URL: https://issues.apache.org/jira/browse/HUDI-1435 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > GH : https://github.com/apache/hudi/issues/2294 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present
[ https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1435: - Status: In Progress (was: Open) > Marker File Reconciliation failing for Non-Partitioned datasets when > duplicate marker files present > --- > > Key: HUDI-1435 > URL: https://issues.apache.org/jira/browse/HUDI-1435 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > GH : https://github.com/apache/hudi/issues/2294 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present
[ https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1435: - Summary: Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present (was: Marker File Reconciliation failing for Non-Partitioned Paths when duplicate marker files present) > Marker File Reconciliation failing for Non-Partitioned datasets when > duplicate marker files present > --- > > Key: HUDI-1435 > URL: https://issues.apache.org/jira/browse/HUDI-1435 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 0.7.0 > > > GH : https://github.com/apache/hudi/issues/2294 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned Paths when duplicate marker files present
[ https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1435: Assignee: Balaji Varadarajan > Marker File Reconciliation failing for Non-Partitioned Paths when duplicate > marker files present > > > Key: HUDI-1435 > URL: https://issues.apache.org/jira/browse/HUDI-1435 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 0.7.0 > > > GH : https://github.com/apache/hudi/issues/2294 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned Paths when duplicate marker files present
Balaji Varadarajan created HUDI-1435: Summary: Marker File Reconciliation failing for Non-Partitioned Paths when duplicate marker files present Key: HUDI-1435 URL: https://issues.apache.org/jira/browse/HUDI-1435 Project: Apache Hudi Issue Type: New Feature Components: Common Core Reporter: Balaji Varadarajan Fix For: 0.7.0 GH : https://github.com/apache/hudi/issues/2294 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1329) Support async compaction in spark DF write()
[ https://issues.apache.org/jira/browse/HUDI-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243675#comment-17243675 ] Balaji Varadarajan commented on HUDI-1329: -- [~309637554]: This API allows only running compaction. Note that there is no input dataframe to be ingested. You can create a dummy dataframe if needed but the operation does not have to care about input DF. It only needs to run compaction (specific compaction id if provided by user) or the oldest one if not provided, > Support async compaction in spark DF write() > > > Key: HUDI-1329 > URL: https://issues.apache.org/jira/browse/HUDI-1329 > Project: Apache Hudi > Issue Type: Improvement > Components: Compaction >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.7.0 > > > spark.write().format("hudi").option(operation, "run_compact") to run > compaction > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1413) Need binary release of Hudi to distribute tools like hudi-cli.sh and hudi-sync
[ https://issues.apache.org/jira/browse/HUDI-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1413: - Fix Version/s: 0.7.0 > Need binary release of Hudi to distribute tools like hudi-cli.sh and hudi-sync > -- > > Key: HUDI-1413 > URL: https://issues.apache.org/jira/browse/HUDI-1413 > Project: Apache Hudi > Issue Type: New Feature > Components: Usability >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.7.0 > > > GH issue : https://github.com/apache/hudi/issues/2270 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1413) Need binary release of Hudi to distribute tools like hudi-cli.sh and hudi-sync
Balaji Varadarajan created HUDI-1413: Summary: Need binary release of Hudi to distribute tools like hudi-cli.sh and hudi-sync Key: HUDI-1413 URL: https://issues.apache.org/jira/browse/HUDI-1413 Project: Apache Hudi Issue Type: New Feature Components: Usability Reporter: Balaji Varadarajan GH issue : https://github.com/apache/hudi/issues/2270 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1413) Need binary release of Hudi to distribute tools like hudi-cli.sh and hudi-sync
[ https://issues.apache.org/jira/browse/HUDI-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1413: - Status: Open (was: New) > Need binary release of Hudi to distribute tools like hudi-cli.sh and hudi-sync > -- > > Key: HUDI-1413 > URL: https://issues.apache.org/jira/browse/HUDI-1413 > Project: Apache Hudi > Issue Type: New Feature > Components: Usability >Reporter: Balaji Varadarajan >Priority: Major > > GH issue : https://github.com/apache/hudi/issues/2270 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1395) HoodieSnapshotCopier not working on non-partitioned datasets
Balaji Varadarajan created HUDI-1395: Summary: HoodieSnapshotCopier not working on non-partitioned datasets Key: HUDI-1395 URL: https://issues.apache.org/jira/browse/HUDI-1395 Project: Apache Hudi Issue Type: Bug Components: Utilities Reporter: Balaji Varadarajan Fix For: 0.6.1 https://github.com/apache/hudi/issues/2244 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1395) HoodieSnapshotCopier not working on non-partitioned datasets
[ https://issues.apache.org/jira/browse/HUDI-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1395: - Status: Open (was: New) > HoodieSnapshotCopier not working on non-partitioned datasets > > > Key: HUDI-1395 > URL: https://issues.apache.org/jira/browse/HUDI-1395 > Project: Apache Hudi > Issue Type: Bug > Components: Utilities >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > https://github.com/apache/hudi/issues/2244 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1205) Serialization fail when log file is larger than 2GB
[ https://issues.apache.org/jira/browse/HUDI-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230386#comment-17230386 ] Balaji Varadarajan commented on HUDI-1205: -- [~leehuynh] [~zuyanton] [~garyli1019] Please see the above comment and try master. > Serialization fail when log file is larger than 2GB > --- > > Key: HUDI-1205 > URL: https://issues.apache.org/jira/browse/HUDI-1205 > Project: Apache Hudi > Issue Type: Bug >Reporter: Yanjia Gary Li >Priority: Major > > When scanning the log file, if the log file(or log file group) is larger than > 2GB, serialization will fail because Hudi uses Integer to store size in byte > for the log file. The maximum integer representing bytes is 2GB. > Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: > org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784 > Serialization trace: > orderingVal (org.apache.hudi.common.model.OverwriteWithLatestAvroPayload) > data (org.apache.hudi.common.model.HoodieRecord) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813) > at > org.apache.hudi.common.util.SerializationUtils$KryoSerializerInstance.deserialize(SerializationUtils.java:107) > at > org.apache.hudi.common.util.SerializationUtils.deserialize(SerializationUtils.java:81) > at > org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:217) > at > org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:211) > at > org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:207) > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:168) > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:55) > at > org.apache.hudi.HoodieMergeOnReadRDD$$anon$1.hasNext(HoodieMergeOnReadRDD.scala:128) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.ClassNotFoundException: > org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784 > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:154) > ... 31 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1205) Serialization fail when log file is larger than 2GB
[ https://issues.apache.org/jira/browse/HUDI-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230385#comment-17230385 ] Balaji Varadarajan commented on HUDI-1205: -- This is likely fixed as part of [https://github.com/apache/hudi/commit/b335459c805748815ccc858ff1a9ef4cd830da8c] Ref: [https://github.com/apache/hudi/issues/2237] Will close once it is confirmed. > Serialization fail when log file is larger than 2GB > --- > > Key: HUDI-1205 > URL: https://issues.apache.org/jira/browse/HUDI-1205 > Project: Apache Hudi > Issue Type: Bug >Reporter: Yanjia Gary Li >Priority: Major > > When scanning the log file, if the log file(or log file group) is larger than > 2GB, serialization will fail because Hudi uses Integer to store size in byte > for the log file. The maximum integer representing bytes is 2GB. > Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: > org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784 > Serialization trace: > orderingVal (org.apache.hudi.common.model.OverwriteWithLatestAvroPayload) > data (org.apache.hudi.common.model.HoodieRecord) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813) > at > org.apache.hudi.common.util.SerializationUtils$KryoSerializerInstance.deserialize(SerializationUtils.java:107) > at > org.apache.hudi.common.util.SerializationUtils.deserialize(SerializationUtils.java:81) > at > org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:217) > at > org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:211) > at > org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:207) > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:168) > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:55) > at > org.apache.hudi.HoodieMergeOnReadRDD$$anon$1.hasNext(HoodieMergeOnReadRDD.scala:128) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.ClassNotFoundException: > org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784 > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:154) > ... 31 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1383) Incorrect partitions getting hive synced
[ https://issues.apache.org/jira/browse/HUDI-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1383: - Status: Open (was: New) > Incorrect partitions getting hive synced > > > Key: HUDI-1383 > URL: https://issues.apache.org/jira/browse/HUDI-1383 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > https://github.com/apache/hudi/issues/2234 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1383) Incorrect partitions getting hive synced
Balaji Varadarajan created HUDI-1383: Summary: Incorrect partitions getting hive synced Key: HUDI-1383 URL: https://issues.apache.org/jira/browse/HUDI-1383 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Reporter: Balaji Varadarajan Fix For: 0.6.1 https://github.com/apache/hudi/issues/2234 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1381) Schedule compaction based on time elapsed
Balaji Varadarajan created HUDI-1381: Summary: Schedule compaction based on time elapsed Key: HUDI-1381 URL: https://issues.apache.org/jira/browse/HUDI-1381 Project: Apache Hudi Issue Type: New Feature Components: Compaction Reporter: Balaji Varadarajan Fix For: 0.7.0 GH : [https://github.com/apache/hudi/issues/2229] It would be helpful to introduce configuration to schedule compaction based on time elapsed since last scheduled compaction. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1381) Schedule compaction based on time elapsed
[ https://issues.apache.org/jira/browse/HUDI-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1381: - Status: Open (was: New) > Schedule compaction based on time elapsed > -- > > Key: HUDI-1381 > URL: https://issues.apache.org/jira/browse/HUDI-1381 > Project: Apache Hudi > Issue Type: New Feature > Components: Compaction >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.7.0 > > > GH : [https://github.com/apache/hudi/issues/2229] > It would be helpful to introduce configuration to schedule compaction based > on time elapsed since last scheduled compaction. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted
[ https://issues.apache.org/jira/browse/HUDI-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1309: - Status: Open (was: New) > Listing Metadata unreadable in S3 as the log block is deemed corrupted > -- > > Key: HUDI-1309 > URL: https://issues.apache.org/jira/browse/HUDI-1309 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Major > > When running metadata list-partitions CLI command, I am seeing the below > messages and the partition list is empty. Was expecting 10K partitions. > > {code:java} > 36589 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning > log file > HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} > 36590 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block > in file > HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} with block size(3723305) running past EOF > 36684 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Log > HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} has a corrupted block at 14 > 44515 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block > in > HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} starts at 3723319 > 44566 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a > corrupt block in > s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045 > 44567 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1365) Listing leaf files and directories is very Slow
[ https://issues.apache.org/jira/browse/HUDI-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225477#comment-17225477 ] Balaji Varadarajan commented on HUDI-1365: -- https://github.com/apache/hudi/commit/9a1f698eef103adadbf7a1bf7b5eb94fb84e > Listing leaf files and directories is very Slow > --- > > Key: HUDI-1365 > URL: https://issues.apache.org/jira/browse/HUDI-1365 > Project: Apache Hudi > Issue Type: Bug >Reporter: Selvaraj periyasamy >Priority: Major > Attachments: image-2020-11-01-01-11-11-561.png, image.png > > > I am using huh 0.5.0 . I took 0.5.0 and used the changes for > HoodieROTablePathFilter from HUDI-1144. Even though it caches, I am seeing > only 46 directories cached in 1 min. Due to this, My job takes lot of time to > write. because I have 6 months worth of hourly partitions. Is there a way to > speed up? I am running it in production cluster and have enough Vcores > available to process. > > HoodieTableMetaClient metaClient = metaClientCache.get(baseDir.toString()); > if (null == metaClient) > { metaClient = new HoodieTableMetaClient(fs.getConf(), baseDir.toString(), > true); metaClientCache.put(baseDir.toString(), metaClient); } > HoodieTableFileSystemView fsView = new HoodieTableFileSystemView(metaClient, > > metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(), > fs.listStatus(folder)); > List latestFiles = > fsView.getLatestDataFiles().collect(Collectors.toList()); > // populate the cache > if (!hoodiePathCache.containsKey(folder.toString())) > { hoodiePathCache.put(folder.toString(), new HashSet<>()); } > LOG.info("Custom Code : Based on hoodie metadata from base path: " + > baseDir.toString() + ", caching " + latestFiles.size() > + " files under " + folder); > for (HoodieDataFile lfile : latestFiles) > { hoodiePathCache.get(folder.toString()).add(new Path(lfile.getPath())); } > > > > Sample Logs here. I have attached the log file as well. > > 20/11/01 08:16:00 INFO HoodieTableFileSystemView: Adding file-groups for > partition :20200919/08, #FileGroups=2 > 20/11/01 08:16:00 INFO AbstractTableFileSystemView: addFilesToView: > NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0 > 20/11/01 08:16:00 INFO HoodieROTablePathFilter: Custom Code : Based on > hoodie metadata from base path: > hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files > under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/08 > 20/11/01 08:16:01 WARN LoadBalancingKMSClientProvider: KMS provider at > threw an IOException!! java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt) > 20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for > partition :20200919/09, #FileGroups=2 > 20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: > NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0 > 20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on > hoodie metadata from base path: > hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files > under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/09 > 20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for > partition :20200919/10, #FileGroups=3 > 20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: > NumFiles=10, FileGroupsCreationTime=1, StoreTimeTaken=0 > 20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on > hoodie metadata from base path: > hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files > under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/10 > 20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at > threw an IOException!! java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt) > 20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at > threw an IOException!! java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt) > 20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for > partition :20200919/11, #FileGroups=2 > 20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: > NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0 > 20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on > hoodie metadata from base path: >
[jira] [Commented] (HUDI-1365) Listing leaf files and directories is very Slow
[ https://issues.apache.org/jira/browse/HUDI-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224750#comment-17224750 ] Balaji Varadarajan commented on HUDI-1365: -- [~Selvaraj.periyasamy1983]: 0.5.0 is a very old version of Hudi. You should try moving to later versions as there are other improvements like removing "rename" operations in them. W.r.t your performance, I see lot of WARN level logs with exceptions getting caught. I am wondering if this is due to misconfiguration and is slowing your query. On a different note, we are going to support feature in the next release which would avoid listing data partitions completely (https://issues.apache.org/jira/browse/HUDI-1292). > Listing leaf files and directories is very Slow > --- > > Key: HUDI-1365 > URL: https://issues.apache.org/jira/browse/HUDI-1365 > Project: Apache Hudi > Issue Type: Bug >Reporter: Selvaraj periyasamy >Priority: Major > Attachments: Log.txt, image-2020-11-01-01-11-11-561.png > > > I am using huh 0.5.0 . I took 0.5.0 and used the changes for > HoodieROTablePathFilter from HUDI-1144. Even though it caches, I am seeing > only 46 directories cached in 1 min. Due to this, My job takes lot of time to > write. because I have 6 months worth of hourly partitions. Is there a way to > speed up? I am running it in production cluster and have enough Vcores > available to process. > > HoodieTableMetaClient metaClient = metaClientCache.get(baseDir.toString()); > if (null == metaClient) > { metaClient = new HoodieTableMetaClient(fs.getConf(), baseDir.toString(), > true); metaClientCache.put(baseDir.toString(), metaClient); } > HoodieTableFileSystemView fsView = new HoodieTableFileSystemView(metaClient, > > metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(), > fs.listStatus(folder)); > List latestFiles = > fsView.getLatestDataFiles().collect(Collectors.toList()); > // populate the cache > if (!hoodiePathCache.containsKey(folder.toString())) > { hoodiePathCache.put(folder.toString(), new HashSet<>()); } > LOG.info("Custom Code : Based on hoodie metadata from base path: " + > baseDir.toString() + ", caching " + latestFiles.size() > + " files under " + folder); > for (HoodieDataFile lfile : latestFiles) > { hoodiePathCache.get(folder.toString()).add(new Path(lfile.getPath())); } > > > > Sample Logs here. I have attached the log file as well. > > 20/11/01 08:16:00 INFO HoodieTableFileSystemView: Adding file-groups for > partition :20200919/08, #FileGroups=2 > 20/11/01 08:16:00 INFO AbstractTableFileSystemView: addFilesToView: > NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0 > 20/11/01 08:16:00 INFO HoodieROTablePathFilter: Custom Code : Based on > hoodie metadata from base path: > hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files > under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/08 > 20/11/01 08:16:01 WARN LoadBalancingKMSClientProvider: KMS provider at > [http://sl73caehmpc1010.visa.com:9292/kms/v1/] threw an IOException!! > java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt) > 20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for > partition :20200919/09, #FileGroups=2 > 20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: > NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0 > 20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on > hoodie metadata from base path: > hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files > under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/09 > 20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for > partition :20200919/10, #FileGroups=3 > 20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: > NumFiles=10, FileGroupsCreationTime=1, StoreTimeTaken=0 > 20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on > hoodie metadata from base path: > hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files > under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/10 > 20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at > [http://sl73caehmpc1009.visa.com:9292/kms/v1/] threw an IOException!! > java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt) > 20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at > [http://sl73caehmpc1010.visa.com:9292/kms/v1/] threw an IOException!! >
[jira] [Created] (HUDI-1368) Merge On Read Snapshot Reader not working for Databricks on ADLS Gen2
Balaji Varadarajan created HUDI-1368: Summary: Merge On Read Snapshot Reader not working for Databricks on ADLS Gen2 Key: HUDI-1368 URL: https://issues.apache.org/jira/browse/HUDI-1368 Project: Apache Hudi Issue Type: New Feature Components: Spark Integration Reporter: Balaji Varadarajan Fix For: 0.7.0 https://github.com/apache/hudi/issues/2180 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1368) Merge On Read Snapshot Reader not working for Databricks on ADLS Gen2
[ https://issues.apache.org/jira/browse/HUDI-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1368: - Status: Open (was: New) > Merge On Read Snapshot Reader not working for Databricks on ADLS Gen2 > - > > Key: HUDI-1368 > URL: https://issues.apache.org/jira/browse/HUDI-1368 > Project: Apache Hudi > Issue Type: New Feature > Components: Spark Integration >Reporter: Balaji Varadarajan >Priority: Major > Labels: adls > Fix For: 0.7.0 > > > https://github.com/apache/hudi/issues/2180 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1363) Provide Option to drop columns after they are used to generate partition or record keys
[ https://issues.apache.org/jira/browse/HUDI-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1363: - Status: Open (was: New) > Provide Option to drop columns after they are used to generate partition or > record keys > --- > > Key: HUDI-1363 > URL: https://issues.apache.org/jira/browse/HUDI-1363 > Project: Apache Hudi > Issue Type: New Feature > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.7.0 > > > Context: https://github.com/apache/hudi/issues/2213 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1363) Provide Option to drop columns after they are used to generate partition or record keys
Balaji Varadarajan created HUDI-1363: Summary: Provide Option to drop columns after they are used to generate partition or record keys Key: HUDI-1363 URL: https://issues.apache.org/jira/browse/HUDI-1363 Project: Apache Hudi Issue Type: New Feature Components: Writer Core Reporter: Balaji Varadarajan Fix For: 0.7.0 Context: https://github.com/apache/hudi/issues/2213 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1358) Memory Leak in HoodieLogFormatWriter
[ https://issues.apache.org/jira/browse/HUDI-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1358: Assignee: Balaji Varadarajan > Memory Leak in HoodieLogFormatWriter > > > Key: HUDI-1358 > URL: https://issues.apache.org/jira/browse/HUDI-1358 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > > https://github.com/apache/hudi/issues/2215 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1358) Memory Leak in HoodieLogFormatWriter
Balaji Varadarajan created HUDI-1358: Summary: Memory Leak in HoodieLogFormatWriter Key: HUDI-1358 URL: https://issues.apache.org/jira/browse/HUDI-1358 Project: Apache Hudi Issue Type: Bug Components: Writer Core Reporter: Balaji Varadarajan https://github.com/apache/hudi/issues/2215 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1358) Memory Leak in HoodieLogFormatWriter
[ https://issues.apache.org/jira/browse/HUDI-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1358: - Status: Open (was: New) > Memory Leak in HoodieLogFormatWriter > > > Key: HUDI-1358 > URL: https://issues.apache.org/jira/browse/HUDI-1358 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Major > > https://github.com/apache/hudi/issues/2215 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1350) Support Partition level delete API in HUDI on top on Insert Overwrite
[ https://issues.apache.org/jira/browse/HUDI-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219950#comment-17219950 ] Balaji Varadarajan commented on HUDI-1350: -- Yes, [~309637554]: You can change the API to take in a list of partitions. At the spark datasource, CLI level, you can accept a path blob for partitions to be deleted. > Support Partition level delete API in HUDI on top on Insert Overwrite > - > > Key: HUDI-1350 > URL: https://issues.apache.org/jira/browse/HUDI-1350 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1340) Not able to query real time table when rows contains nested elements
[ https://issues.apache.org/jira/browse/HUDI-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219370#comment-17219370 ] Balaji Varadarajan commented on HUDI-1340: -- This is likely related to parquet (serde and related library) versions difference between the write and read (parquet-hive) side. > Not able to query real time table when rows contains nested elements > > > Key: HUDI-1340 > URL: https://issues.apache.org/jira/browse/HUDI-1340 > Project: Apache Hudi > Issue Type: Bug >Reporter: Bharat Dighe >Priority: Major > Attachments: create_avro.py, user.avsc, users1.avro, users2.avro, > users3.avro, users4.avro, users5.avro > > > AVRO schema: Attached > Script to generate sample data: attached > Sample data attached > == > the schema as nested elements, here is the output from hive > {code:java} > CREATE EXTERNAL TABLE `users_mor_rt`( > `_hoodie_commit_time` string, > `_hoodie_commit_seqno` string, > `_hoodie_record_key` string, > `_hoodie_partition_path` string, > `_hoodie_file_name` string, > `name` string, > `userid` int, > `datehired` string, > `meta` struct, > `experience` > struct>>) > PARTITIONED BY ( > `role` string) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION > 'hdfs://namenode:8020/tmp/hudi_repair_order_mor' > TBLPROPERTIES ( > 'last_commit_time_sync'='20201011190954', > 'transient_lastDdlTime'='1602442906') > {code} > scala code: > {code:java} > import java.io.File > import org.apache.hudi.QuickstartUtils._ > import org.apache.spark.sql.SaveMode._ > import org.apache.avro.Schema > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val tableName = "users_mor" > // val basePath = "hdfs:///tmp/hudi_repair_order_mor" > val basePath = "hdfs:///tmp/hudi_repair_order_mor" > // Insert Data > /// local not hdfs !!! > //val schema = new Schema.Parser().parse(new > File("/var/hoodie/ws/docker/demo/data/user/user.avsc")) > def updateHudi( num:String, op:String) = { > val path = "hdfs:///var/demo/data/user/users" + num + ".avro" > println( path ); > val avdf2 = new org.apache.spark.sql.SQLContext(sc).read.format("avro"). > // option("avroSchema", schema.toString). > load(path) > avdf2.select("name").show(false) > avdf2.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(OPERATION_OPT_KEY,op). > option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ"). // > default:COPY_ON_WRITE, MERGE_ON_READ > option(KEYGENERATOR_CLASS_OPT_KEY, > "org.apache.hudi.keygen.ComplexKeyGenerator"). > option(PRECOMBINE_FIELD_OPT_KEY, "meta.ingestTime"). // dedup > option(RECORDKEY_FIELD_OPT_KEY, "userId"). // key > option(PARTITIONPATH_FIELD_OPT_KEY, "role"). > option(TABLE_NAME, tableName). > option("hoodie.compact.inline", false). > option(HIVE_STYLE_PARTITIONING_OPT_KEY, "true"). > option(HIVE_SYNC_ENABLED_OPT_KEY, "true"). > option(HIVE_TABLE_OPT_KEY, tableName). > option(HIVE_USER_OPT_KEY, "hive"). > option(HIVE_PASS_OPT_KEY, "hive"). > option(HIVE_URL_OPT_KEY, "jdbc:hive2://hiveserver:1"). > option(HIVE_PARTITION_FIELDS_OPT_KEY, "role"). > option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor"). > option("hoodie.datasource.hive_sync.assume_date_partitioning", > "false"). > mode(Append). > save(basePath) > spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, > _hoodie_partition_path, experience.companies[0] from " + tableName + > "_rt").show() > spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, > _hoodie_partition_path, _hoodie_commit_seqno from " + tableName + > "_ro").show() > } > updateHudi("1", "bulkinsert") > updateHudi("2", "upsert") > updateHudi("3", "upsert") > updateHudi("4", "upsert") > {code} > If nested fields are not included, it works fine > {code} > scala> spark.sql("select name from users_mor_rt"); > res19: org.apache.spark.sql.DataFrame = [name: string] > scala> spark.sql("select name from users_mor_rt").show(); > +-+ > | name| > +-+ > |engg3| > |engg1_new| > |engg2_new| > | mgr1| > | mgr2| > | devops1| > | devops2| > +-+ > {code} > But fails when I include nested field 'experience' > {code} > scala> spark.sql("select
[jira] [Commented] (HUDI-1340) Not able to query real time table when rows contains nested elements
[ https://issues.apache.org/jira/browse/HUDI-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216913#comment-17216913 ] Balaji Varadarajan commented on HUDI-1340: -- [~bdighe]: Did you use --conf spark.sql.hive.convertMetastoreParquet=false when you started your spark-shell where you are running the query ? https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-Whydowehavetoset2differentwaysofconfiguringSparktoworkwithHudi? > Not able to query real time table when rows contains nested elements > > > Key: HUDI-1340 > URL: https://issues.apache.org/jira/browse/HUDI-1340 > Project: Apache Hudi > Issue Type: Bug >Reporter: Bharat Dighe >Priority: Major > Attachments: create_avro.py, user.avsc, users1.avro, users2.avro, > users3.avro, users4.avro, users5.avro > > > AVRO schema: Attached > Script to generate sample data: attached > Sample data attached > == > the schema as nested elements, here is the output from hive > {code:java} > CREATE EXTERNAL TABLE `users_mor_rt`( > `_hoodie_commit_time` string, > `_hoodie_commit_seqno` string, > `_hoodie_record_key` string, > `_hoodie_partition_path` string, > `_hoodie_file_name` string, > `name` string, > `userid` int, > `datehired` string, > `meta` struct, > `experience` > struct>>) > PARTITIONED BY ( > `role` string) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION > 'hdfs://namenode:8020/tmp/hudi_repair_order_mor' > TBLPROPERTIES ( > 'last_commit_time_sync'='20201011190954', > 'transient_lastDdlTime'='1602442906') > {code} > scala code: > {code:java} > import java.io.File > import org.apache.hudi.QuickstartUtils._ > import org.apache.spark.sql.SaveMode._ > import org.apache.avro.Schema > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val tableName = "users_mor" > // val basePath = "hdfs:///tmp/hudi_repair_order_mor" > val basePath = "hdfs:///tmp/hudi_repair_order_mor" > // Insert Data > /// local not hdfs !!! > //val schema = new Schema.Parser().parse(new > File("/var/hoodie/ws/docker/demo/data/user/user.avsc")) > def updateHudi( num:String, op:String) = { > val path = "hdfs:///var/demo/data/user/users" + num + ".avro" > println( path ); > val avdf2 = new org.apache.spark.sql.SQLContext(sc).read.format("avro"). > // option("avroSchema", schema.toString). > load(path) > avdf2.select("name").show(false) > avdf2.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(OPERATION_OPT_KEY,op). > option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ"). // > default:COPY_ON_WRITE, MERGE_ON_READ > option(KEYGENERATOR_CLASS_OPT_KEY, > "org.apache.hudi.keygen.ComplexKeyGenerator"). > option(PRECOMBINE_FIELD_OPT_KEY, "meta.ingestTime"). // dedup > option(RECORDKEY_FIELD_OPT_KEY, "userId"). // key > option(PARTITIONPATH_FIELD_OPT_KEY, "role"). > option(TABLE_NAME, tableName). > option("hoodie.compact.inline", false). > option(HIVE_STYLE_PARTITIONING_OPT_KEY, "true"). > option(HIVE_SYNC_ENABLED_OPT_KEY, "true"). > option(HIVE_TABLE_OPT_KEY, tableName). > option(HIVE_USER_OPT_KEY, "hive"). > option(HIVE_PASS_OPT_KEY, "hive"). > option(HIVE_URL_OPT_KEY, "jdbc:hive2://hiveserver:1"). > option(HIVE_PARTITION_FIELDS_OPT_KEY, "role"). > option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor"). > option("hoodie.datasource.hive_sync.assume_date_partitioning", > "false"). > mode(Append). > save(basePath) > spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, > _hoodie_partition_path, experience.companies[0] from " + tableName + > "_rt").show() > spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, > _hoodie_partition_path, _hoodie_commit_seqno from " + tableName + > "_ro").show() > } > updateHudi("1", "bulkinsert") > updateHudi("2", "upsert") > updateHudi("3", "upsert") > updateHudi("4", "upsert") > {code} > If nested fields are not included, it works fine > {code} > scala> spark.sql("select name from users_mor_rt"); > res19: org.apache.spark.sql.DataFrame = [name: string] > scala> spark.sql("select name from users_mor_rt").show(); > +-+ > | name| > +-+ > |engg3| > |engg1_new| > |engg2_new| > | mgr1| > | mgr2| > |
[jira] [Updated] (HUDI-1340) Not able to query real time table when rows contains nested elements
[ https://issues.apache.org/jira/browse/HUDI-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1340: - Status: Open (was: New) > Not able to query real time table when rows contains nested elements > > > Key: HUDI-1340 > URL: https://issues.apache.org/jira/browse/HUDI-1340 > Project: Apache Hudi > Issue Type: Bug >Reporter: Bharat Dighe >Priority: Major > Attachments: create_avro.py, user.avsc, users1.avro, users2.avro, > users3.avro, users4.avro, users5.avro > > > AVRO schema: Attached > Script to generate sample data: attached > Sample data attached > == > the schema as nested elements, here is the output from hive > {code:java} > CREATE EXTERNAL TABLE `users_mor_rt`( > `_hoodie_commit_time` string, > `_hoodie_commit_seqno` string, > `_hoodie_record_key` string, > `_hoodie_partition_path` string, > `_hoodie_file_name` string, > `name` string, > `userid` int, > `datehired` string, > `meta` struct, > `experience` > struct>>) > PARTITIONED BY ( > `role` string) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION > 'hdfs://namenode:8020/tmp/hudi_repair_order_mor' > TBLPROPERTIES ( > 'last_commit_time_sync'='20201011190954', > 'transient_lastDdlTime'='1602442906') > {code} > scala code: > {code:java} > import java.io.File > import org.apache.hudi.QuickstartUtils._ > import org.apache.spark.sql.SaveMode._ > import org.apache.avro.Schema > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val tableName = "users_mor" > // val basePath = "hdfs:///tmp/hudi_repair_order_mor" > val basePath = "hdfs:///tmp/hudi_repair_order_mor" > // Insert Data > /// local not hdfs !!! > //val schema = new Schema.Parser().parse(new > File("/var/hoodie/ws/docker/demo/data/user/user.avsc")) > def updateHudi( num:String, op:String) = { > val path = "hdfs:///var/demo/data/user/users" + num + ".avro" > println( path ); > val avdf2 = new org.apache.spark.sql.SQLContext(sc).read.format("avro"). > // option("avroSchema", schema.toString). > load(path) > avdf2.select("name").show(false) > avdf2.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(OPERATION_OPT_KEY,op). > option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ"). // > default:COPY_ON_WRITE, MERGE_ON_READ > option(KEYGENERATOR_CLASS_OPT_KEY, > "org.apache.hudi.keygen.ComplexKeyGenerator"). > option(PRECOMBINE_FIELD_OPT_KEY, "meta.ingestTime"). // dedup > option(RECORDKEY_FIELD_OPT_KEY, "userId"). // key > option(PARTITIONPATH_FIELD_OPT_KEY, "role"). > option(TABLE_NAME, tableName). > option("hoodie.compact.inline", false). > option(HIVE_STYLE_PARTITIONING_OPT_KEY, "true"). > option(HIVE_SYNC_ENABLED_OPT_KEY, "true"). > option(HIVE_TABLE_OPT_KEY, tableName). > option(HIVE_USER_OPT_KEY, "hive"). > option(HIVE_PASS_OPT_KEY, "hive"). > option(HIVE_URL_OPT_KEY, "jdbc:hive2://hiveserver:1"). > option(HIVE_PARTITION_FIELDS_OPT_KEY, "role"). > option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor"). > option("hoodie.datasource.hive_sync.assume_date_partitioning", > "false"). > mode(Append). > save(basePath) > spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, > _hoodie_partition_path, experience.companies[0] from " + tableName + > "_rt").show() > spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, > _hoodie_partition_path, _hoodie_commit_seqno from " + tableName + > "_ro").show() > } > updateHudi("1", "bulkinsert") > updateHudi("2", "upsert") > updateHudi("3", "upsert") > updateHudi("4", "upsert") > {code} > If nested fields are not included, it works fine > {code} > scala> spark.sql("select name from users_mor_rt"); > res19: org.apache.spark.sql.DataFrame = [name: string] > scala> spark.sql("select name from users_mor_rt").show(); > +-+ > | name| > +-+ > |engg3| > |engg1_new| > |engg2_new| > | mgr1| > | mgr2| > | devops1| > | devops2| > +-+ > {code} > But fails when I include nested field 'experience' > {code} > scala> spark.sql("select name, experience from users_mor_rt").show(); > 20/10/11 19:53:58 ERROR executor.Executor: Exception in task 0.0 in stage > 147.0 (TID 153) >
[jira] [Commented] (HUDI-845) Allow parallel writing and move the pending rollback work into cleaner
[ https://issues.apache.org/jira/browse/HUDI-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215226#comment-17215226 ] Balaji Varadarajan commented on HUDI-845: - Yes [~309637554]. this ticket is for tracking general concurrent writes. Supporting partition level concurrency could be the first phase in implementation and so we might have to do that first. > Allow parallel writing and move the pending rollback work into cleaner > -- > > Key: HUDI-845 > URL: https://issues.apache.org/jira/browse/HUDI-845 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Vinoth Chandar >Assignee: Balaji Varadarajan >Priority: Blocker > Labels: help-requested > Fix For: 0.7.0 > > > Things to think about > * Commit time has to be unique across writers > * Parallel writers can finish commits out of order i.e c2 commits before c1. > * MOR log blocks fence uncommited data.. > * Cleaner should loudly complain if it cannot finish cleaning up partial > writes. > > P.S: think about what is left for the general thing : log files may have > different order, inserts may violate uniqueness constraint -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1343) Add standard schema postprocessor which would rewrite the schema using spark-avro conversion
[ https://issues.apache.org/jira/browse/HUDI-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1343: - Fix Version/s: 0.7.0 > Add standard schema postprocessor which would rewrite the schema using > spark-avro conversion > > > Key: HUDI-1343 > URL: https://issues.apache.org/jira/browse/HUDI-1343 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.7.0 > > > When we use Transformer, the final Schema which we use to convert avro record > to bytes is auto generated by spark. This could be different (due to the way > Avro treats it) from the target schema that is being used to write (as the > target schema could be coming from Schema Registry). > > For example : > Schema generated by spark-avro when converting Row to avro > { > "type" : "record", > "name" : "hoodie_source", > "namespace" : "hoodie.source", > "fields" : [ { > "name" : "_ts_ms", > "type" : [ "long", "null" ] > }, { > "name" : "_op", > "type" : "string" > }, { > "name" : "inc_id", > "type" : "int" > }, { > "name" : "year", > "type" : [ "int", "null" ] > }, { > "name" : "violation_desc", > "type" : [ "string", "null" ] > }, { > "name" : "violation_code", > "type" : [ "string", "null" ] > }, { > "name" : "case_individual_id", > "type" : [ "int", "null" ] > }, { > "name" : "flag", > "type" : [ "string", "null" ] > }, { > "name" : "last_modified_ts", > "type" : "long" > } ] > } > > is not compatible with the Avro Schema: > > { > "type" : "record", > "name" : "formatted_debezium_payload", > "fields" : [ { > "name" : "_ts_ms", > "type" : [ "null", "long" ], > "default" : null > }, { > "name" : "_op", > "type" : "string", > "default" : null > }, { > "name" : "inc_id", > "type" : "int", > "default" : null > }, { > "name" : "year", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "violation_desc", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "violation_code", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "case_individual_id", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "flag", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "last_modified_ts", > "type" : "long", > "default" : null > } ] > } > > Note that the type order is different for individual fields : > "type" : [ "null", "string" ], vs "type" : [ "string", "null" ] > Unexpectedly, Avro decoding fails when bytes written with first schema is > read using second schema. > > One way to fix is to use configured target schema when generating record > bytes but this is not easy without breaking Record payload constructor API > used by deltastreamer. > The other option is to apply a post-processor on target schema to make it > schema consistent with Transformer generated records. > > This ticket is to use the later approach of creating a standard schema > post-processor and adding it by default when Transformer is used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1343) Add standard schema postprocessor which would rewrite the schema using spark-avro conversion
Balaji Varadarajan created HUDI-1343: Summary: Add standard schema postprocessor which would rewrite the schema using spark-avro conversion Key: HUDI-1343 URL: https://issues.apache.org/jira/browse/HUDI-1343 Project: Apache Hudi Issue Type: Improvement Components: DeltaStreamer Reporter: Balaji Varadarajan When we use Transformer, the final Schema which we use to convert avro record to bytes is auto generated by spark. This could be different (due to the way Avro treats it) from the target schema that is being used to write (as the target schema could be coming from Schema Registry). For example : Schema generated by spark-avro when converting Row to avro { "type" : "record", "name" : "hoodie_source", "namespace" : "hoodie.source", "fields" : [ { "name" : "_ts_ms", "type" : [ "long", "null" ] }, { "name" : "_op", "type" : "string" }, { "name" : "inc_id", "type" : "int" }, { "name" : "year", "type" : [ "int", "null" ] }, { "name" : "violation_desc", "type" : [ "string", "null" ] }, { "name" : "violation_code", "type" : [ "string", "null" ] }, { "name" : "case_individual_id", "type" : [ "int", "null" ] }, { "name" : "flag", "type" : [ "string", "null" ] }, { "name" : "last_modified_ts", "type" : "long" } ] } is not compatible with the Avro Schema: { "type" : "record", "name" : "formatted_debezium_payload", "fields" : [ { "name" : "_ts_ms", "type" : [ "null", "long" ], "default" : null }, { "name" : "_op", "type" : "string", "default" : null }, { "name" : "inc_id", "type" : "int", "default" : null }, { "name" : "year", "type" : [ "null", "int" ], "default" : null }, { "name" : "violation_desc", "type" : [ "null", "string" ], "default" : null }, { "name" : "violation_code", "type" : [ "null", "string" ], "default" : null }, { "name" : "case_individual_id", "type" : [ "null", "int" ], "default" : null }, { "name" : "flag", "type" : [ "null", "string" ], "default" : null }, { "name" : "last_modified_ts", "type" : "long", "default" : null } ] } Note that the type order is different for individual fields : "type" : [ "null", "string" ], vs "type" : [ "string", "null" ] Unexpectedly, Avro decoding fails when bytes written with first schema is read using second schema. One way to fix is to use configured target schema when generating record bytes but this is not easy without breaking Record payload constructor API used by deltastreamer. The other option is to apply a post-processor on target schema to make it schema consistent with Transformer generated records. This ticket is to use the later approach of creating a standard schema post-processor and adding it by default when Transformer is used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1343) Add standard schema postprocessor which would rewrite the schema using spark-avro conversion
[ https://issues.apache.org/jira/browse/HUDI-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1343: - Status: Open (was: New) > Add standard schema postprocessor which would rewrite the schema using > spark-avro conversion > > > Key: HUDI-1343 > URL: https://issues.apache.org/jira/browse/HUDI-1343 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Balaji Varadarajan >Priority: Major > > When we use Transformer, the final Schema which we use to convert avro record > to bytes is auto generated by spark. This could be different (due to the way > Avro treats it) from the target schema that is being used to write (as the > target schema could be coming from Schema Registry). > > For example : > Schema generated by spark-avro when converting Row to avro > { > "type" : "record", > "name" : "hoodie_source", > "namespace" : "hoodie.source", > "fields" : [ { > "name" : "_ts_ms", > "type" : [ "long", "null" ] > }, { > "name" : "_op", > "type" : "string" > }, { > "name" : "inc_id", > "type" : "int" > }, { > "name" : "year", > "type" : [ "int", "null" ] > }, { > "name" : "violation_desc", > "type" : [ "string", "null" ] > }, { > "name" : "violation_code", > "type" : [ "string", "null" ] > }, { > "name" : "case_individual_id", > "type" : [ "int", "null" ] > }, { > "name" : "flag", > "type" : [ "string", "null" ] > }, { > "name" : "last_modified_ts", > "type" : "long" > } ] > } > > is not compatible with the Avro Schema: > > { > "type" : "record", > "name" : "formatted_debezium_payload", > "fields" : [ { > "name" : "_ts_ms", > "type" : [ "null", "long" ], > "default" : null > }, { > "name" : "_op", > "type" : "string", > "default" : null > }, { > "name" : "inc_id", > "type" : "int", > "default" : null > }, { > "name" : "year", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "violation_desc", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "violation_code", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "case_individual_id", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "flag", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "last_modified_ts", > "type" : "long", > "default" : null > } ] > } > > Note that the type order is different for individual fields : > "type" : [ "null", "string" ], vs "type" : [ "string", "null" ] > Unexpectedly, Avro decoding fails when bytes written with first schema is > read using second schema. > > One way to fix is to use configured target schema when generating record > bytes but this is not easy without breaking Record payload constructor API > used by deltastreamer. > The other option is to apply a post-processor on target schema to make it > schema consistent with Transformer generated records. > > This ticket is to use the later approach of creating a standard schema > post-processor and adding it by default when Transformer is used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1329) Support async compaction in spark DF write()
[ https://issues.apache.org/jira/browse/HUDI-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1329: - Status: Open (was: New) > Support async compaction in spark DF write() > > > Key: HUDI-1329 > URL: https://issues.apache.org/jira/browse/HUDI-1329 > Project: Apache Hudi > Issue Type: Improvement > Components: Compaction >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.7.0 > > > spark.write().format("hudi").option(operation, "run_compact") to run > compaction > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1329) Support async compaction in spark DF write()
Balaji Varadarajan created HUDI-1329: Summary: Support async compaction in spark DF write() Key: HUDI-1329 URL: https://issues.apache.org/jira/browse/HUDI-1329 Project: Apache Hudi Issue Type: Improvement Components: Compaction Reporter: Balaji Varadarajan Fix For: 0.7.0 spark.write().format("hudi").option(operation, "run_compact") to run compaction -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-898) Need to add Schema parameter to HoodieRecordPayload::preCombine
[ https://issues.apache.org/jira/browse/HUDI-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-898: Status: Open (was: New) > Need to add Schema parameter to HoodieRecordPayload::preCombine > --- > > Key: HUDI-898 > URL: https://issues.apache.org/jira/browse/HUDI-898 > Project: Apache Hudi > Issue Type: Improvement > Components: Common Core >Reporter: Yixue Zhu >Priority: Major > > We are working on Mongo Oplog integration with Hudi, to stream Mongo updates > to Hudi tables. > There are 4 Mongo OpLog operations we need to handle, CRUD (create, read, > update, delete). > Currently Hudi handle create/read, delete, but not update well with existing > preCombine API in HoodieRecordPayload class. In particularly, Update > operation contains "patch" field, which is extended Json describing update > for dot separated field paths. > We need to pass Avro schema to preCombine API for it to work: > Even though BaseAvroPayload constructor accepts GenericRecord, which has Avro > schema reference, but it materialize GenericRecord to bytes, to support > serialization/deserialization by ExternalSpillableMap. > > Is there concern/objection to this? in other words, have I overlooked > something? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-898) Need to add Schema parameter to HoodieRecordPayload::preCombine
[ https://issues.apache.org/jira/browse/HUDI-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-898: --- Assignee: Balaji Varadarajan > Need to add Schema parameter to HoodieRecordPayload::preCombine > --- > > Key: HUDI-898 > URL: https://issues.apache.org/jira/browse/HUDI-898 > Project: Apache Hudi > Issue Type: Improvement > Components: Common Core >Reporter: Yixue Zhu >Assignee: Balaji Varadarajan >Priority: Major > > We are working on Mongo Oplog integration with Hudi, to stream Mongo updates > to Hudi tables. > There are 4 Mongo OpLog operations we need to handle, CRUD (create, read, > update, delete). > Currently Hudi handle create/read, delete, but not update well with existing > preCombine API in HoodieRecordPayload class. In particularly, Update > operation contains "patch" field, which is extended Json describing update > for dot separated field paths. > We need to pass Avro schema to preCombine API for it to work: > Even though BaseAvroPayload constructor accepts GenericRecord, which has Avro > schema reference, but it materialize GenericRecord to bytes, to support > serialization/deserialization by ExternalSpillableMap. > > Is there concern/objection to this? in other words, have I overlooked > something? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1308) Issues found during testing RFC-15
[ https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205435#comment-17205435 ] Balaji Varadarajan commented on HUDI-1308: -- cc [~vinoth] > Issues found during testing RFC-15 > -- > > Key: HUDI-1308 > URL: https://issues.apache.org/jira/browse/HUDI-1308 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > > THis is an umbrella ticket containing all the issues found during testing > RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1311) Writes creating/updating large number of files seeing errors when deleting marker files in S3
Balaji Varadarajan created HUDI-1311: Summary: Writes creating/updating large number of files seeing errors when deleting marker files in S3 Key: HUDI-1311 URL: https://issues.apache.org/jira/browse/HUDI-1311 Project: Apache Hudi Issue Type: Sub-task Components: Writer Core Reporter: Balaji Varadarajan Dont have the exception trace handy. Will add them when I run into this next time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1310) Corruption Block Handling too slow in S3
Balaji Varadarajan created HUDI-1310: Summary: Corruption Block Handling too slow in S3 Key: HUDI-1310 URL: https://issues.apache.org/jira/browse/HUDI-1310 Project: Apache Hudi Issue Type: Sub-task Components: Writer Core Reporter: Balaji Varadarajan The logic to figure out next valid starting block offset is too slow when run in S3. I have bolded the log message that takes long time to appear. 36589 [Spring Shell] INFO org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning log file HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} 36590 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block in file HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} with block size(3723305) running past EOF 36684 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Log HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} has a corrupted block at 14 *44515 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block in* HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} starts at 3723319 44566 [Spring Shell] INFO org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a corrupt block in s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1308) Issues found during testing RFC-15
[ https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1308: Assignee: Balaji Varadarajan (was: Prashant Wason) > Issues found during testing RFC-15 > -- > > Key: HUDI-1308 > URL: https://issues.apache.org/jira/browse/HUDI-1308 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > > THis is an umbrella ticket containing all the issues found during testing > RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1308) Issues found during testing RFC-15
[ https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1308: Assignee: Prashant Wason > Issues found during testing RFC-15 > -- > > Key: HUDI-1308 > URL: https://issues.apache.org/jira/browse/HUDI-1308 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Prashant Wason >Priority: Major > > THis is an umbrella ticket containing all the issues found during testing > RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted
[ https://issues.apache.org/jira/browse/HUDI-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1309: Assignee: Prashant Wason > Listing Metadata unreadable in S3 as the log block is deemed corrupted > -- > > Key: HUDI-1309 > URL: https://issues.apache.org/jira/browse/HUDI-1309 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Prashant Wason >Priority: Major > > When running metadata list-partitions CLI command, I am seeing the below > messages and the partition list is empty. Was expecting 10K partitions. > > 36589 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning > log file > HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} > 36590 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block > in file > HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} with block size(3723305) running past EOF > 36684 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Log > HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} has a corrupted block at 14 > 44515 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block > in > HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} starts at 3723319 > 44566 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a > corrupt block in > s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045 > 44567 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted
Balaji Varadarajan created HUDI-1309: Summary: Listing Metadata unreadable in S3 as the log block is deemed corrupted Key: HUDI-1309 URL: https://issues.apache.org/jira/browse/HUDI-1309 Project: Apache Hudi Issue Type: Sub-task Components: Writer Core Reporter: Balaji Varadarajan When running metadata list-partitions CLI command, I am seeing the below messages and the partition list is empty. Was expecting 10K partitions. 36589 [Spring Shell] INFO org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning log file HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} 36590 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block in file HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} with block size(3723305) running past EOF 36684 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Log HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} has a corrupted block at 14 44515 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block in HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} starts at 3723319 44566 [Spring Shell] INFO org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a corrupt block in s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045 44567 [Spring Shell] INFO org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1308) Issues found during testing RFC-15
Balaji Varadarajan created HUDI-1308: Summary: Issues found during testing RFC-15 Key: HUDI-1308 URL: https://issues.apache.org/jira/browse/HUDI-1308 Project: Apache Hudi Issue Type: Improvement Components: Writer Core Reporter: Balaji Varadarajan THis is an umbrella ticket containing all the issues found during testing RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1308) Issues found during testing RFC-15
[ https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1308: - Status: Open (was: New) > Issues found during testing RFC-15 > -- > > Key: HUDI-1308 > URL: https://issues.apache.org/jira/browse/HUDI-1308 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Major > > THis is an umbrella ticket containing all the issues found during testing > RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1257) Insert only write operations should preserve duplicate records
[ https://issues.apache.org/jira/browse/HUDI-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201628#comment-17201628 ] Balaji Varadarajan commented on HUDI-1257: -- [~nicholasjiang]: Yes, they are same. You can dupe one of those jira. > Insert only write operations should preserve duplicate records > -- > > Key: HUDI-1257 > URL: https://issues.apache.org/jira/browse/HUDI-1257 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Nicholas Jiang >Priority: Major > Fix For: 0.6.1 > > > [https://github.com/apache/hudi/issues/2051] > > ``` > I think the point [@jiegzhan|https://github.com/jiegzhan] pointed out is > reasonable, for insert operation, we should not update the existing records. > Right now the behavior/result is different when setting different small file > limit, when it is set to 0, the new inserts will not update the old records > and write into a new file, but when it is set to other value such as 128M, > the new inserts may update the old records lies in small file picked up the > UpsertPartitioner. > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1290) Implement Debezium avro source for Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1290: - Status: Open (was: New) > Implement Debezium avro source for Delta Streamer > - > > Key: HUDI-1290 > URL: https://issues.apache.org/jira/browse/HUDI-1290 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > We need to implement transformer and payloads for seamlessly pulling change > logs emitted by debezium in Kafka. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1290) Implement Debezium avro source for Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1290: Assignee: Balaji Varadarajan > Implement Debezium avro source for Delta Streamer > - > > Key: HUDI-1290 > URL: https://issues.apache.org/jira/browse/HUDI-1290 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > We need to implement transformer and payloads for seamlessly pulling change > logs emitted by debezium in Kafka. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1290) Implement Debezium avro source for Delta Streamer
Balaji Varadarajan created HUDI-1290: Summary: Implement Debezium avro source for Delta Streamer Key: HUDI-1290 URL: https://issues.apache.org/jira/browse/HUDI-1290 Project: Apache Hudi Issue Type: Improvement Components: DeltaStreamer Reporter: Balaji Varadarajan Fix For: 0.6.1 We need to implement transformer and payloads for seamlessly pulling change logs emitted by debezium in Kafka. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1270) NoSuchMethod PartitionedFile on AWS EMR Spark 2.4.5
[ https://issues.apache.org/jira/browse/HUDI-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1270: - Status: Open (was: New) > NoSuchMethod PartitionedFile on AWS EMR Spark 2.4.5 > --- > > Key: HUDI-1270 > URL: https://issues.apache.org/jira/browse/HUDI-1270 > Project: Apache Hudi > Issue Type: Bug >Reporter: Gary Li >Priority: Major > > There are some AWS EMR users reporting: > java.lang.NoSuchMethodError: > org.apache.spark.sql.execution.datasources.PartitionedFile. > on EMR (Spark-2.4.5-amzn-0) when using the Spark Datasource to query MOR > table. > [https://github.com/apache/hudi/pull/1848#issuecomment-687392285] > [https://github.com/apache/hudi/issues/2057#issuecomment-685015564] > [~uditme] [~vbalaji] would you guys able to help? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1270) NoSuchMethod PartitionedFile on AWS EMR Spark 2.4.5
[ https://issues.apache.org/jira/browse/HUDI-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195158#comment-17195158 ] Balaji Varadarajan commented on HUDI-1270: -- [~uditme] : Pinging > NoSuchMethod PartitionedFile on AWS EMR Spark 2.4.5 > --- > > Key: HUDI-1270 > URL: https://issues.apache.org/jira/browse/HUDI-1270 > Project: Apache Hudi > Issue Type: Bug >Reporter: Gary Li >Priority: Major > > There are some AWS EMR users reporting: > java.lang.NoSuchMethodError: > org.apache.spark.sql.execution.datasources.PartitionedFile. > on EMR (Spark-2.4.5-amzn-0) when using the Spark Datasource to query MOR > table. > [https://github.com/apache/hudi/pull/1848#issuecomment-687392285] > [https://github.com/apache/hudi/issues/2057#issuecomment-685015564] > [~uditme] [~vbalaji] would you guys able to help? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1280) Add tool to capture earliest or latest offsets in kafka topics
Balaji Varadarajan created HUDI-1280: Summary: Add tool to capture earliest or latest offsets in kafka topics Key: HUDI-1280 URL: https://issues.apache.org/jira/browse/HUDI-1280 Project: Apache Hudi Issue Type: Improvement Components: DeltaStreamer Reporter: Balaji Varadarajan Fix For: 0.6.1 For bootstrapping cases using spark.write(), we need to capture offsets from kafka topic and use it as checkpoint for subsequent read from Kafka topics. [https://github.com/apache/hudi/issues/1985] We need to build this integration for smooth transition to deltastreamer. -- This message was sent by Atlassian Jira (v8.3.4#803005)