[jira] [Commented] (HUDI-146) Impala Support
[ https://issues.apache.org/jira/browse/HUDI-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968043#comment-16968043 ] Yanjia Gary Li commented on HUDI-146: - Hello [~vinoth], Yuanbin finished his internship a few months ago, please assign this ticket to me and I will give it a try. > Impala Support > -- > > Key: HUDI-146 > URL: https://issues.apache.org/jira/browse/HUDI-146 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Yuanbin Cheng >Priority: Major > > [https://github.com/apache/incubator-hudi/issues/179] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-146) Impala Support
[ https://issues.apache.org/jira/browse/HUDI-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968638#comment-16968638 ] Yanjia Gary Li commented on HUDI-146: - [~vinoth] is there any hudi related code in the Hive code base? > Impala Support > -- > > Key: HUDI-146 > URL: https://issues.apache.org/jira/browse/HUDI-146 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > > [https://github.com/apache/incubator-hudi/issues/179] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-318) Update Migration Guide to Include Delta Streamer
Yanjia Gary Li created HUDI-318: --- Summary: Update Migration Guide to Include Delta Streamer Key: HUDI-318 URL: https://issues.apache.org/jira/browse/HUDI-318 Project: Apache Hudi (incubating) Issue Type: Improvement Reporter: Yanjia Gary Li Assignee: Yanjia Gary Li [http://hudi.apache.org/migration_guide.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time
Yanjia Gary Li created HUDI-415: --- Summary: HoodieSparkSqlWriter Commit time not representing the Spark job starting time Key: HUDI-415 URL: https://issues.apache.org/jira/browse/HUDI-415 Project: Apache Hudi (incubating) Issue Type: Bug Reporter: Yanjia Gary Li Assignee: Yanjia Gary Li Hudi records the commit time after the first action complete. If there is a heavy transformation before isEmpty(), then the commit time could be inaccurate. {code:java} if (hoodieRecords.isEmpty()) { log.info("new batch has no new records, skipping...") return (true, common.util.Option.empty()) } commitTime = client.startCommit() writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, commitTime, operation) {code} For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 hours, then the commit time in the .hoodie folder will be 201901010*2*00. If I use the commit time to ingest data starting from 201901010200(from HDFS, not using deltastreamer), then I will miss 2 hours of data. Is this set up intended? Can we move the commit time before isEmpty()? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-259) Hadoop 3 support for Hudi writing
[ https://issues.apache.org/jira/browse/HUDI-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995153#comment-16995153 ] Yanjia Gary Li commented on HUDI-259: - Hello, I recently started using Hadoop 3 and Spark 2.4. [https://github.com/apache/incubator-hudi/commit/7bc08cbfdce337ad980bb544ec9fc3dbdf9c#diff-832156391e3edd5b0ceb86007ce6ae41] enable me to compile Hudi with Hadoop 3, but some tests are failed. > Hadoop 3 support for Hudi writing > - > > Key: HUDI-259 > URL: https://issues.apache.org/jira/browse/HUDI-259 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Usability >Reporter: Vinoth Chandar >Assignee: Pratyaksh Sharma >Priority: Major > > Sample issues > > [https://github.com/apache/incubator-hudi/issues/735] > [https://github.com/apache/incubator-hudi/issues/877#issuecomment-528433568] > [https://github.com/apache/incubator-hudi/issues/898] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time
[ https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-415: Status: Closed (was: Patch Available) > HoodieSparkSqlWriter Commit time not representing the Spark job starting time > - > > Key: HUDI-415 > URL: https://issues.apache.org/jira/browse/HUDI-415 > Project: Apache Hudi (incubating) > Issue Type: Bug >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hudi records the commit time after the first action complete. If there is a > heavy transformation before isEmpty(), then the commit time could be > inaccurate. > {code:java} > if (hoodieRecords.isEmpty()) { > log.info("new batch has no new records, skipping...") > return (true, common.util.Option.empty()) > } > commitTime = client.startCommit() > writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, > commitTime, operation) > {code} > For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 > hours, then the commit time in the .hoodie folder will be 201901010*2*00. If > I use the commit time to ingest data starting from 201901010200(from HDFS, > not using deltastreamer), then I will miss 2 hours of data. > Is this set up intended? Can we move the commit time before isEmpty()? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time
[ https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-415: Status: Patch Available (was: In Progress) > HoodieSparkSqlWriter Commit time not representing the Spark job starting time > - > > Key: HUDI-415 > URL: https://issues.apache.org/jira/browse/HUDI-415 > Project: Apache Hudi (incubating) > Issue Type: Bug >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hudi records the commit time after the first action complete. If there is a > heavy transformation before isEmpty(), then the commit time could be > inaccurate. > {code:java} > if (hoodieRecords.isEmpty()) { > log.info("new batch has no new records, skipping...") > return (true, common.util.Option.empty()) > } > commitTime = client.startCommit() > writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, > commitTime, operation) > {code} > For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 > hours, then the commit time in the .hoodie folder will be 201901010*2*00. If > I use the commit time to ingest data starting from 201901010200(from HDFS, > not using deltastreamer), then I will miss 2 hours of data. > Is this set up intended? Can we move the commit time before isEmpty()? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time
[ https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001084#comment-17001084 ] Yanjia Gary Li commented on HUDI-415: - PR merged. Issue resolved. > HoodieSparkSqlWriter Commit time not representing the Spark job starting time > - > > Key: HUDI-415 > URL: https://issues.apache.org/jira/browse/HUDI-415 > Project: Apache Hudi (incubating) > Issue Type: Bug >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hudi records the commit time after the first action complete. If there is a > heavy transformation before isEmpty(), then the commit time could be > inaccurate. > {code:java} > if (hoodieRecords.isEmpty()) { > log.info("new batch has no new records, skipping...") > return (true, common.util.Option.empty()) > } > commitTime = client.startCommit() > writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, > commitTime, operation) > {code} > For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 > hours, then the commit time in the .hoodie folder will be 201901010*2*00. If > I use the commit time to ingest data starting from 201901010200(from HDFS, > not using deltastreamer), then I will miss 2 hours of data. > Is this set up intended? Can we move the commit time before isEmpty()? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-610) Impala nea real time table support
Yanjia Gary Li created HUDI-610: --- Summary: Impala nea real time table support Key: HUDI-610 URL: https://issues.apache.org/jira/browse/HUDI-610 Project: Apache Hudi (incubating) Issue Type: New Feature Reporter: Yanjia Gary Li Assignee: Yanjia Gary Li Impala uses the JAVA based module call "frontend" to list all the files to scan and let the C++ based "backend" to do all the file scanning. Merge Avro and Parquet could be difficult because it might need to have a custom merging logic like RealtimeCompactedRecordReader to be implemented in backend using C++, but I think it will be doable to have something like RealtimeUnmergedRecordReader which only need some changes in the frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-611) Impala sync tool
Yanjia Gary Li created HUDI-611: --- Summary: Impala sync tool Key: HUDI-611 URL: https://issues.apache.org/jira/browse/HUDI-611 Project: Apache Hudi (incubating) Issue Type: New Feature Reporter: Yanjia Gary Li Assignee: Yanjia Gary Li Like sync to Hive. We need a tool to sync with Impala. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-644) Enable to retrieve checkpoint from previous commits in Delta Streamer
Yanjia Gary Li created HUDI-644: --- Summary: Enable to retrieve checkpoint from previous commits in Delta Streamer Key: HUDI-644 URL: https://issues.apache.org/jira/browse/HUDI-644 Project: Apache Hudi (incubating) Issue Type: Improvement Components: DeltaStreamer Reporter: Yanjia Gary Li Assignee: Yanjia Gary Li This ticket is to resolve the following problem: The user is using a homebrew Spark data source to read new data and write to Hudi table The user would like to migrate to Delta Streamer But the Delta Streamer only checks the last commit metadata, if there is no checkpoint info, then the Delta Streamer will use the default. For Kafka source, it is LATEST. The user would like to run the homebrew Spark data source reader and Delta Streamer in parallel to prevent data loss, but the Spark data source writer will make commit without checkpoint info, which will reset the delta streamer. So if we have an option to allow the user to retrieve the checkpoint from previous commits instead of the latest commit would be helpful for the migration. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-644) Enable to retrieve checkpoint from previous commits in Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-644: Status: Open (was: New) > Enable to retrieve checkpoint from previous commits in Delta Streamer > - > > Key: HUDI-644 > URL: https://issues.apache.org/jira/browse/HUDI-644 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This ticket is to resolve the following problem: > The user is using a homebrew Spark data source to read new data and write to > Hudi table > The user would like to migrate to Delta Streamer > But the Delta Streamer only checks the last commit metadata, if there is no > checkpoint info, then the Delta Streamer will use the default. For Kafka > source, it is LATEST. > The user would like to run the homebrew Spark data source reader and Delta > Streamer in parallel to prevent data loss, but the Spark data source writer > will make commit without checkpoint info, which will reset the delta > streamer. > So if we have an option to allow the user to retrieve the checkpoint from > previous commits instead of the latest commit would be helpful for the > migration. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-644) Enable to retrieve checkpoint from previous commits in Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-644: Status: In Progress (was: Open) > Enable to retrieve checkpoint from previous commits in Delta Streamer > - > > Key: HUDI-644 > URL: https://issues.apache.org/jira/browse/HUDI-644 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This ticket is to resolve the following problem: > The user is using a homebrew Spark data source to read new data and write to > Hudi table > The user would like to migrate to Delta Streamer > But the Delta Streamer only checks the last commit metadata, if there is no > checkpoint info, then the Delta Streamer will use the default. For Kafka > source, it is LATEST. > The user would like to run the homebrew Spark data source reader and Delta > Streamer in parallel to prevent data loss, but the Spark data source writer > will make commit without checkpoint info, which will reset the delta > streamer. > So if we have an option to allow the user to retrieve the checkpoint from > previous commits instead of the latest commit would be helpful for the > migration. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-644) Enable to retrieve checkpoint from previous commits in Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-644: Fix Version/s: 0.6.0 > Enable to retrieve checkpoint from previous commits in Delta Streamer > - > > Key: HUDI-644 > URL: https://issues.apache.org/jira/browse/HUDI-644 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This ticket is to resolve the following problem: > The user is using a homebrew Spark data source to read new data and write to > Hudi table > The user would like to migrate to Delta Streamer > But the Delta Streamer only checks the last commit metadata, if there is no > checkpoint info, then the Delta Streamer will use the default. For Kafka > source, it is LATEST. > The user would like to run the homebrew Spark data source reader and Delta > Streamer in parallel to prevent data loss, but the Spark data source writer > will make commit without checkpoint info, which will reset the delta > streamer. > So if we have an option to allow the user to retrieve the checkpoint from > previous commits instead of the latest commit would be helpful for the > migration. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-315) Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators
[ https://issues.apache.org/jira/browse/HUDI-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li closed HUDI-315. --- Resolution: Won't Fix > Reimplement statistics/workload profile collected during writes using Spark > 2.x custom accumulators > --- > > Key: HUDI-315 > URL: https://issues.apache.org/jira/browse/HUDI-315 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Performance, Writer Core >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > > https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1 > > In Hudi, there are two places where we need to obtain statistics on the input > data > - HoodieBloomIndex : for knowing what partitions need to be loaded and > checked against (is this still needed with the timeline server enabled is a > separate question) > - Workload profile to get a sense of number of updates, inserts to each > partition/file group > Both of them issue their own groupBy or shuffle computation today. This can > be avoided using an accumulator -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-315) Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators
[ https://issues.apache.org/jira/browse/HUDI-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047137#comment-17047137 ] Yanjia Gary Li commented on HUDI-315: - Agree. Closing this ticket. > Reimplement statistics/workload profile collected during writes using Spark > 2.x custom accumulators > --- > > Key: HUDI-315 > URL: https://issues.apache.org/jira/browse/HUDI-315 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Performance, Writer Core >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > > https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1 > > In Hudi, there are two places where we need to obtain statistics on the input > data > - HoodieBloomIndex : for knowing what partitions need to be loaded and > checked against (is this still needed with the timeline server enabled is a > separate question) > - Workload profile to get a sense of number of updates, inserts to each > partition/file group > Both of them issue their own groupBy or shuffle computation today. This can > be avoided using an accumulator -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time
[ https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li resolved HUDI-415. - Resolution: Fixed > HoodieSparkSqlWriter Commit time not representing the Spark job starting time > - > > Key: HUDI-415 > URL: https://issues.apache.org/jira/browse/HUDI-415 > Project: Apache Hudi (incubating) > Issue Type: Bug >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Labels: pull-request-available > Fix For: 0.5.1 > > Time Spent: 10m > Remaining Estimate: 0h > > Hudi records the commit time after the first action complete. If there is a > heavy transformation before isEmpty(), then the commit time could be > inaccurate. > {code:java} > if (hoodieRecords.isEmpty()) { > log.info("new batch has no new records, skipping...") > return (true, common.util.Option.empty()) > } > commitTime = client.startCommit() > writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, > commitTime, operation) > {code} > For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 > hours, then the commit time in the .hoodie folder will be 201901010*2*00. If > I use the commit time to ingest data starting from 201901010200(from HDFS, > not using deltastreamer), then I will miss 2 hours of data. > Is this set up intended? Can we move the commit time before isEmpty()? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time
[ https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li reopened HUDI-415: - > HoodieSparkSqlWriter Commit time not representing the Spark job starting time > - > > Key: HUDI-415 > URL: https://issues.apache.org/jira/browse/HUDI-415 > Project: Apache Hudi (incubating) > Issue Type: Bug >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Labels: pull-request-available > Fix For: 0.5.1 > > Time Spent: 10m > Remaining Estimate: 0h > > Hudi records the commit time after the first action complete. If there is a > heavy transformation before isEmpty(), then the commit time could be > inaccurate. > {code:java} > if (hoodieRecords.isEmpty()) { > log.info("new batch has no new records, skipping...") > return (true, common.util.Option.empty()) > } > commitTime = client.startCommit() > writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, > commitTime, operation) > {code} > For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 > hours, then the commit time in the .hoodie folder will be 201901010*2*00. If > I use the commit time to ingest data starting from 201901010200(from HDFS, > not using deltastreamer), then I will miss 2 hours of data. > Is this set up intended? Can we move the commit time before isEmpty()? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-315) Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators
[ https://issues.apache.org/jira/browse/HUDI-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045986#comment-17045986 ] Yanjia Gary Li commented on HUDI-315: - I will take a look at this ticket > Reimplement statistics/workload profile collected during writes using Spark > 2.x custom accumulators > --- > > Key: HUDI-315 > URL: https://issues.apache.org/jira/browse/HUDI-315 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Performance, Writer Core >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > > https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1 > > In Hudi, there are two places where we need to obtain statistics on the input > data > - HoodieBloomIndex : for knowing what partitions need to be loaded and > checked against (is this still needed with the timeline server enabled is a > separate question) > - Workload profile to get a sense of number of updates, inserts to each > partition/file group > Both of them issue their own groupBy or shuffle computation today. This can > be avoided using an accumulator -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-315) Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators
[ https://issues.apache.org/jira/browse/HUDI-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li reassigned HUDI-315: --- Assignee: Yanjia Gary Li > Reimplement statistics/workload profile collected during writes using Spark > 2.x custom accumulators > --- > > Key: HUDI-315 > URL: https://issues.apache.org/jira/browse/HUDI-315 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Performance, Writer Core >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > > https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1 > > In Hudi, there are two places where we need to obtain statistics on the input > data > - HoodieBloomIndex : for knowing what partitions need to be loaded and > checked against (is this still needed with the timeline server enabled is a > separate question) > - Workload profile to get a sense of number of updates, inserts to each > partition/file group > Both of them issue their own groupBy or shuffle computation today. This can > be avoided using an accumulator -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions
[ https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-597: Description: For the use case that I only need to pull the incremental part of certain partitions, I need to do the incremental pulling from the entire dataset first then filtering in Spark. If we can use the folder partitions directly as part of the input path, it could run faster by only load relevant parquet files. Example: {code:java} spark.read.format("org.apache.hudi") .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000") .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "/year=2016/*/*/*") .load(path) {code} was: For the use case that I only need to pull the incremental part of certain partitions, I need to do the incremental pulling from the entire dataset first then filtering in Spark. If we can use the folder partitions directly as part of the input path, it could run faster by only load relevant parquet files. Example: {code:java} spark.read.format("org.apache.hudi") .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000") .load(path, "year=2020/*/*/*") {code} > Enable incremental pulling from defined partitions > -- > > Key: HUDI-597 > URL: https://issues.apache.org/jira/browse/HUDI-597 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Fix For: 0.5.2 > > Time Spent: 20m > Remaining Estimate: 0h > > For the use case that I only need to pull the incremental part of certain > partitions, I need to do the incremental pulling from the entire dataset > first then filtering in Spark. > If we can use the folder partitions directly as part of the input path, it > could run faster by only load relevant parquet files. > Example: > > {code:java} > spark.read.format("org.apache.hudi") > .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL) > .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000") > .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "/year=2016/*/*/*") > .load(path) > > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-597) Enable incremental pulling from defined partitions
[ https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li resolved HUDI-597. - Resolution: Fixed PR merged. Will update the DOC after 0.5.2 release > Enable incremental pulling from defined partitions > -- > > Key: HUDI-597 > URL: https://issues.apache.org/jira/browse/HUDI-597 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For the use case that I only need to pull the incremental part of certain > partitions, I need to do the incremental pulling from the entire dataset > first then filtering in Spark. > If we can use the folder partitions directly as part of the input path, it > could run faster by only load relevant parquet files. > Example: > > {code:java} > spark.read.format("org.apache.hudi") > .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL) > .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000") > .load(path, "year=2020/*/*/*") > > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions
[ https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-597: Fix Version/s: 0.5.2 > Enable incremental pulling from defined partitions > -- > > Key: HUDI-597 > URL: https://issues.apache.org/jira/browse/HUDI-597 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Fix For: 0.5.2 > > Time Spent: 20m > Remaining Estimate: 0h > > For the use case that I only need to pull the incremental part of certain > partitions, I need to do the incremental pulling from the entire dataset > first then filtering in Spark. > If we can use the folder partitions directly as part of the input path, it > could run faster by only load relevant parquet files. > Example: > > {code:java} > spark.read.format("org.apache.hudi") > .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL) > .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000") > .load(path, "year=2020/*/*/*") > > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions
[ https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-597: Status: Open (was: New) > Enable incremental pulling from defined partitions > -- > > Key: HUDI-597 > URL: https://issues.apache.org/jira/browse/HUDI-597 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For the use case that I only need to pull the incremental part of certain > partitions, I need to do the incremental pulling from the entire dataset > first then filtering in Spark. > If we can use the folder partitions directly as part of the input path, it > could run faster by only load relevant parquet files. > Example: > > {code:java} > spark.read.format("org.apache.hudi") > .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL) > .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000") > .load(path, "year=2020/*/*/*") > > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-611) Add Impala Guide to Doc
[ https://issues.apache.org/jira/browse/HUDI-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-611: Status: Open (was: New) > Add Impala Guide to Doc > --- > > Key: HUDI-611 > URL: https://issues.apache.org/jira/browse/HUDI-611 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Like sync to Hive. We need a tool to sync with Impala. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-611) Add Impala Guide to Doc
[ https://issues.apache.org/jira/browse/HUDI-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li resolved HUDI-611. - Resolution: Fixed > Add Impala Guide to Doc > --- > > Key: HUDI-611 > URL: https://issues.apache.org/jira/browse/HUDI-611 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Like sync to Hive. We need a tool to sync with Impala. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-611) Add Impala Guide to Doc
[ https://issues.apache.org/jira/browse/HUDI-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-611: Status: In Progress (was: Open) > Add Impala Guide to Doc > --- > > Key: HUDI-611 > URL: https://issues.apache.org/jira/browse/HUDI-611 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Like sync to Hive. We need a tool to sync with Impala. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions
[ https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-597: Status: In Progress (was: Open) > Enable incremental pulling from defined partitions > -- > > Key: HUDI-597 > URL: https://issues.apache.org/jira/browse/HUDI-597 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For the use case that I only need to pull the incremental part of certain > partitions, I need to do the incremental pulling from the entire dataset > first then filtering in Spark. > If we can use the folder partitions directly as part of the input path, it > could run faster by only load relevant parquet files. > Example: > > {code:java} > spark.read.format("org.apache.hudi") > .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL) > .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000") > .load(path, "year=2020/*/*/*") > > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-597) Enable incremental pulling from defined partitions
Yanjia Gary Li created HUDI-597: --- Summary: Enable incremental pulling from defined partitions Key: HUDI-597 URL: https://issues.apache.org/jira/browse/HUDI-597 Project: Apache Hudi (incubating) Issue Type: New Feature Reporter: Yanjia Gary Li Assignee: Yanjia Gary Li For the use case that I only need to pull the incremental part of certain partitions, I need to do the incremental pulling from the entire dataset first then filtering in Spark. If we can use the folder partitions directly as part of the input path, it could run faster by only load relevant parquet files. Example: {code:java} spark.read.format("org.apache.hudi") .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000") .load(path, "year=2020/*/*/*") {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-146) Impala Support
[ https://issues.apache.org/jira/browse/HUDI-146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li resolved HUDI-146. - Resolution: Done read optimized table now support by Impala. Fixed by: [https://github.com/apache/impala/commit/ea0e1def6160d596082b01365fcbbb6e24afb21d] Sample query to create Hudi table: [https://github.com/apache/impala/blob/ea0e1def6160d596082b01365fcbbb6e24afb21d/testdata/datasets/functional/functional_schema_template.sql#L2758] > Impala Support > -- > > Key: HUDI-146 > URL: https://issues.apache.org/jira/browse/HUDI-146 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Hive Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > > [https://github.com/apache/incubator-hudi/issues/179] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-611) Add Impala Guide to Doc
[ https://issues.apache.org/jira/browse/HUDI-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-611: Summary: Add Impala Guide to Doc (was: Impala sync tool) > Add Impala Guide to Doc > --- > > Key: HUDI-611 > URL: https://issues.apache.org/jira/browse/HUDI-611 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > > Like sync to Hive. We need a tool to sync with Impala. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-611) Add Impala Guide to Doc
[ https://issues.apache.org/jira/browse/HUDI-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-611: Priority: Minor (was: Major) > Add Impala Guide to Doc > --- > > Key: HUDI-611 > URL: https://issues.apache.org/jira/browse/HUDI-611 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Like sync to Hive. We need a tool to sync with Impala. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-494: Description: I am using the manual build master after [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] commit. EDIT: tried with the latest master but got the same result I am seeing 3 million tasks when the Hudi Spark job writing the files into HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ folder in my HDFS. In the Spark UI, each task only writes less than 10 records in {code:java} count at HoodieSparkSqlWriter{code} All the stages before this seem normal. Any idea what happened here? My first guess would be something related to the bloom filter index. Maybe somewhere trigger the repartitioning with the bloom filter index? But I am not really familiar with that part of the code. Thanks was: I am using the manual build master after [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] commit. I am seeing 3 million tasks when the Hudi Spark job writing the files into HDFS. I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ folder in my HDFS. In the Spark UI, each task only writes less than 10 records in {code:java} count at HoodieSparkSqlWriter{code} All the stages before this seems normal. Any idea what happened here? > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Vinoth Chandar >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-494: Attachment: Screen Shot 2020-01-02 at 8.53.44 PM.png > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Vinoth Chandar >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seems normal. Any idea what happened here? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-494: Attachment: Screen Shot 2020-01-02 at 8.53.24 PM.png > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Vinoth Chandar >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seems normal. Any idea what happened here? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
Yanjia Gary Li created HUDI-494: --- Summary: [DEBUGGING] Huge amount of tasks when writing files into HDFS Key: HUDI-494 URL: https://issues.apache.org/jira/browse/HUDI-494 Project: Apache Hudi (incubating) Issue Type: Test Reporter: Yanjia Gary Li Assignee: Vinoth Chandar Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 2020-01-02 at 8.53.44 PM.png I am using the manual build master after [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] commit. I am seeing 3 million tasks when the Hudi Spark job writing the files into HDFS. I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ folder in my HDFS. In the Spark UI, each task only writes less than 10 records in {code:java} count at HoodieSparkSqlWriter{code} All the stages before this seems normal. Any idea what happened here? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008199#comment-17008199 ] Yanjia Gary Li commented on HUDI-494: - Hello [~lamber-ken], Thanks for trying this out. This behavior is very strange and I haven't seen this happened before with the older version of Hudi(0.4.7). I recently upgraded my cluster to Hadoop 3 and Spark 2.4 with the latest Hudi snapshot. More details about my scenario: * My dataset was partitioned by year/month/day and the total number of parquet files is about a few thousand. The total size of the data set was about a few TBs. * When the upsert job was running(halfway done with the 3 million tasks), there was only one partition under .hoodie/.temp/20200102/year=2020/month=01/day=01/, but in that partition, there are tons of parquet.marker files. * I also checked the delta input, they should be under the same partition. * I used hoodie.index.bloom.num_entries = 2,000,000 based on the number of records in each parquet. * Max parquet size was set to 128MB and min was 100MB. My guess of the cause: * In the initial bulkInsert, I set the bulkInsertParallelism too high that caused the average size of the parquet files is about 30MB, which is below the min value I set. But I guess this might be not related. I am rerunning the initial bulkInsert job with lower parallelism then run an upsert job to see what would happen. * 3 million tasks look like some sort of overflow. I need to dig into the code for this. > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Vinoth Chandar >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010031#comment-17010031 ] Yanjia Gary Li commented on HUDI-494: - [~vinoth] Thanks for the feedback. The code snippets were prepared by [~lamber-ken] and my dataset has this issue was partitioned by year/month/day/hour. The behavior I observed was the path */.hoodie/.temp/20200101/year=2020/month=1/day=1/hour=00* has a ton of files. For my dataset, I calculate the parallelism based on the input data size. I set *bulkInsertParallelism = inputSizeInMB / 100* which was 6 for my 6TB dataset. the *upsertParallelism = 10* based on the input size when I ran this upsert job. I will reproduce this once I got the chance and provide more details. > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Vinoth Chandar >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-644) checkpoint generator tool for delta streamer
[ https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-644: Summary: checkpoint generator tool for delta streamer (was: Enable to retrieve checkpoint from previous commits in Delta Streamer) > checkpoint generator tool for delta streamer > > > Key: HUDI-644 > URL: https://issues.apache.org/jira/browse/HUDI-644 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This ticket is to resolve the following problem: > The user is using a homebrew Spark data source to read new data and write to > Hudi table > The user would like to migrate to Delta Streamer > But the Delta Streamer only checks the last commit metadata, if there is no > checkpoint info, then the Delta Streamer will use the default. For Kafka > source, it is LATEST. > The user would like to run the homebrew Spark data source reader and Delta > Streamer in parallel to prevent data loss, but the Spark data source writer > will make commit without checkpoint info, which will reset the delta > streamer. > So if we have an option to allow the user to retrieve the checkpoint from > previous commits instead of the latest commit would be helpful for the > migration. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-773) Hudi On Azure Data Lake Storage V2
[ https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-773: Summary: Hudi On Azure Data Lake Storage V2 (was: Hudi On Azure Data Lake Storage) > Hudi On Azure Data Lake Storage V2 > -- > > Key: HUDI-773 > URL: https://issues.apache.org/jira/browse/HUDI-773 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Usability >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-773) Hudi On Azure Data Lake Storage
Yanjia Gary Li created HUDI-773: --- Summary: Hudi On Azure Data Lake Storage Key: HUDI-773 URL: https://issues.apache.org/jira/browse/HUDI-773 Project: Apache Hudi (incubating) Issue Type: New Feature Components: Usability Reporter: Yanjia Gary Li Assignee: Yanjia Gary Li -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-759) Integrate checkpoint provider
[ https://issues.apache.org/jira/browse/HUDI-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li resolved HUDI-759. - Resolution: Fixed > Integrate checkpoint provider > - > > Key: HUDI-759 > URL: https://issues.apache.org/jira/browse/HUDI-759 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082773#comment-17082773 ] Yanjia Gary Li edited comment on HUDI-69 at 4/14/20, 10:11 PM: --- After a closer look, I think Spark datasource support for realtime table needs: * Support hadoop.mapreduce.xxx apis. We use hadoop.mapred.RecordReader, but Spark sql use hadoop.mapreduce.RecordReader. We need to figure how to support both apis, or upgrade to mapreduce. * Implement the extension of ParquetInputFormat from Spark or a custom data source reader to handle merge. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala] * Use Datasource V2 to be the default data source. Please let me know what you guys think. was (Author: garyli1019): After a closer look, I think Spark datasource support for realtime table needs: * Refactoring HoodieRealtimeFormat and (file split, record reader). Decouple Hudi logic from the MapredParquetInputFormat. I think we can maintain the Hudi file split and path filtering in a central place, and able to be adopted by different query engines. With bootstrap support, the file format maintenance could be more complicated. I think this is very essential. * Implement the extension of ParquetInputFormat from Spark or a custom data source reader to handle merge. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala] * Use Datasource V2 to be the default data source. Please let me know what you guys think. > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > https://github.com/uber/hudi/issues/136 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-765) Implement OrcReaderIterator
[ https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li reassigned HUDI-765: --- Assignee: Yanjia Gary Li > Implement OrcReaderIterator > --- > > Key: HUDI-765 > URL: https://issues.apache.org/jira/browse/HUDI-765 > Project: Apache Hudi (incubating) > Issue Type: Sub-task >Reporter: lamber-ken >Assignee: Yanjia Gary Li >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-791) Replace null by Option in Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085942#comment-17085942 ] Yanjia Gary Li commented on HUDI-791: - [~tison] Thanks for looking into this ticket! The initiative here is to make the code look cleaner and more robust. If you are interested to improve the delta streamer, please feel free to claim this ticket :) > Replace null by Option in Delta Streamer > > > Key: HUDI-791 > URL: https://issues.apache.org/jira/browse/HUDI-791 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: DeltaStreamer, newbie >Reporter: Yanjia Gary Li >Priority: Minor > > There is a lot of null in Delta Streamer. That will be great if we can > replace those null by Option. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2
[ https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086087#comment-17086087 ] Yanjia Gary Li commented on HUDI-773: - Hello [~sasikumar.venkat], I am very new to Azure. How is your cluster set up? Are you using HDInsign or Databricks? Is your Spark cluster attached to the storage account or access it through an API? > Hudi On Azure Data Lake Storage V2 > -- > > Key: HUDI-773 > URL: https://issues.apache.org/jira/browse/HUDI-773 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Usability >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Fix For: 0.6.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-805) Verify which types of Azure storage support Hudi
Yanjia Gary Li created HUDI-805: --- Summary: Verify which types of Azure storage support Hudi Key: HUDI-805 URL: https://issues.apache.org/jira/browse/HUDI-805 Project: Apache Hudi (incubating) Issue Type: Sub-task Reporter: Yanjia Gary Li Azure has the following storage options: Azure Data Lake Storage Gen 1 Azure Data Lake Storage Gen 2 Azure Blob Storage(legacy name: windows azure storage blob) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-804) Add Azure Support to Hudi Doc
Yanjia Gary Li created HUDI-804: --- Summary: Add Azure Support to Hudi Doc Key: HUDI-804 URL: https://issues.apache.org/jira/browse/HUDI-804 Project: Apache Hudi (incubating) Issue Type: Sub-task Reporter: Yanjia Gary Li -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2
[ https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085409#comment-17085409 ] Yanjia Gary Li commented on HUDI-773: - Hello [~sasikumar.venkat], thanks for sharing! I am able to write Hudi data without OAUTH. We are probably first few people in the community using Hudi on Azure, so I believe we need to figure this out :) I will try to reproduce your issue. Will update here once I tried. > Hudi On Azure Data Lake Storage V2 > -- > > Key: HUDI-773 > URL: https://issues.apache.org/jira/browse/HUDI-773 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Usability >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Fix For: 0.6.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082773#comment-17082773 ] Yanjia Gary Li commented on HUDI-69: After a closer look, I think Spark datasource support for realtime table needs: * Refactoring HoodieRealtimeFormat and (file split, record reader). Decouple Hudi logic from the MapredParquetInputFormat. I think we can maintain the Hudi file split and path filtering in a central place, and able to be adopted by different query engines. With bootstrap support, the file format maintenance could be more complicated. I think this is very essential. * Implement the extension of ParquetInputFormat from Spark or a custom data source reader to handle merge. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala] * Use Datasource V2 to be the default data source. Please let me know what you guys think. > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > https://github.com/uber/hudi/issues/136 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-791) Replace null by Option in Delta Streamer
Yanjia Gary Li created HUDI-791: --- Summary: Replace null by Option in Delta Streamer Key: HUDI-791 URL: https://issues.apache.org/jira/browse/HUDI-791 Project: Apache Hudi (incubating) Issue Type: New Feature Components: DeltaStreamer, newbie Reporter: Yanjia Gary Li There is a lot of null in Delta Streamer. That will be great if we can replace those null by Option. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-791) Replace null by Option in Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-791: Issue Type: Improvement (was: New Feature) > Replace null by Option in Delta Streamer > > > Key: HUDI-791 > URL: https://issues.apache.org/jira/browse/HUDI-791 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: DeltaStreamer, newbie >Reporter: Yanjia Gary Li >Priority: Minor > > There is a lot of null in Delta Streamer. That will be great if we can > replace those null by Option. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-30) Explore support for Spark Datasource V2
[ https://issues.apache.org/jira/browse/HUDI-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li reassigned HUDI-30: -- Assignee: Yanjia Gary Li > Explore support for Spark Datasource V2 > --- > > Key: HUDI-30 > URL: https://issues.apache.org/jira/browse/HUDI-30 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > > https://github.com/uber/hudi/issues/501 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-30) Explore support for Spark Datasource V2
[ https://issues.apache.org/jira/browse/HUDI-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-30: --- Status: In Progress (was: Open) > Explore support for Spark Datasource V2 > --- > > Key: HUDI-30 > URL: https://issues.apache.org/jira/browse/HUDI-30 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > > https://github.com/uber/hudi/issues/501 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2
[ https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087994#comment-17087994 ] Yanjia Gary Li commented on HUDI-773: - [~sasikumar.venkat] I haven't tried Databricks Spark myself, but one of my colleagues tried that before and have some issues with the Hudi write, probably related to yours. As Vinoth mentioned, any debugging info would be helpful. I will also try it myself later > Hudi On Azure Data Lake Storage V2 > -- > > Key: HUDI-773 > URL: https://issues.apache.org/jira/browse/HUDI-773 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Usability >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Fix For: 0.6.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-822) Decouple hoodie related methods with Hoodie Input Formats
Yanjia Gary Li created HUDI-822: --- Summary: Decouple hoodie related methods with Hoodie Input Formats Key: HUDI-822 URL: https://issues.apache.org/jira/browse/HUDI-822 Project: Apache Hudi (incubating) Issue Type: Sub-task Reporter: Yanjia Gary Li Assignee: Yanjia Gary Li In order to support multiple query engines, we need to generalize the Hudi input format and Hudi record merging logic. And decouple from MapredParquetInputFormat, which is depending on Hive. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2
[ https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081030#comment-17081030 ] Yanjia Gary Li commented on HUDI-773: - surprisingly easy...I tried the following test using Spark2.4 HDinsigh cluster with Azure Data Lake Storage V2. Hudi ran out of the box. No extra config needed. {code:java} // Initial Batch val outputPath = "/Test/HudiWrite" val df1 = Seq( ("0", "year=2019", "test1", "pass", "201901"), ("1", "year=2019", "test1", "pass", "201901"), ("2", "year=2020", "test1", "pass", "201901"), ("3", "year=2020", "test1", "pass", "201901") ).toDF("_uuid", "_partition", "PARAM_NAME", "RESULT_STRING", "TIMESTAMP") val bulk_insert_ops = Map( DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_uuid", DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "_partition", DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL, DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "TIMESTAMP", "hoodie.bulkinsert.shuffle.parallelism" -> "10", "hoodie.upsert.shuffle.parallelism" -> "10", HoodieWriteConfig.TABLE_NAME -> "test" ) df1.write.format("org.apache.hudi").options(bulk_insert_ops).mode(SaveMode.Overwrite).save(outputPath) // Upsert val upsert_ops = Map( DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_uuid", DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "_partition", DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL, DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "TIMESTAMP", "hoodie.bulkinsert.shuffle.parallelism" -> "10", "hoodie.upsert.shuffle.parallelism" -> "10", HoodieWriteConfig.TABLE_NAME -> "test" ) val df2 = Seq( ("0", "year=2019", "test1", "pass", "201910"), ("1", "year=2019", "test1", "pass", "201910"), ("2", "year=2020", "test1", "pass", "201910"), ("3", "year=2020", "test1", "pass", "201910") ).toDF("_uuid", "_partition", "PARAM_NAME", "RESULT_STRING", "TIMESTAMP") df2.write.format("org.apache.hudi").options(upsert_ops).mode(SaveMode.Append).save(outputPath) // Read as hudi format val df_read = spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).load(outputPath) assert(df_read.count() == 4){code} > Hudi On Azure Data Lake Storage V2 > -- > > Key: HUDI-773 > URL: https://issues.apache.org/jira/browse/HUDI-773 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Usability >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2
[ https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081032#comment-17081032 ] Yanjia Gary Li commented on HUDI-773: - Any extra tests needed? What tests have you guys done for AWS and GCP? [~vinoth] [~vbalaji] > Hudi On Azure Data Lake Storage V2 > -- > > Key: HUDI-773 > URL: https://issues.apache.org/jira/browse/HUDI-773 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Usability >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-773) Hudi On Azure Data Lake Storage V2
[ https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-773: Status: In Progress (was: Open) > Hudi On Azure Data Lake Storage V2 > -- > > Key: HUDI-773 > URL: https://issues.apache.org/jira/browse/HUDI-773 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Usability >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-773) Hudi On Azure Data Lake Storage V2
[ https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-773: Fix Version/s: 0.6.0 > Hudi On Azure Data Lake Storage V2 > -- > > Key: HUDI-773 > URL: https://issues.apache.org/jira/browse/HUDI-773 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Usability >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Fix For: 0.6.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072382#comment-17072382 ] Yanjia Gary Li commented on HUDI-69: [~vinoth] I am happy to work on this ticket. Please assign to me > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 0.6.0 > > > https://github.com/uber/hudi/issues/136 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-759) Integrate checkpoint provider
Yanjia Gary Li created HUDI-759: --- Summary: Integrate checkpoint provider Key: HUDI-759 URL: https://issues.apache.org/jira/browse/HUDI-759 Project: Apache Hudi (incubating) Issue Type: New Feature Reporter: Yanjia Gary Li Assignee: Yanjia Gary Li Fix For: 0.6.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-759) Integrate checkpoint provider
[ https://issues.apache.org/jira/browse/HUDI-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-759: Status: Open (was: New) > Integrate checkpoint provider > - > > Key: HUDI-759 > URL: https://issues.apache.org/jira/browse/HUDI-759 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Fix For: 0.6.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-759) Integrate checkpoint provider
[ https://issues.apache.org/jira/browse/HUDI-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-759: Status: In Progress (was: Open) > Integrate checkpoint provider > - > > Key: HUDI-759 > URL: https://issues.apache.org/jira/browse/HUDI-759 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Fix For: 0.6.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-644) checkpoint generator tool for delta streamer
[ https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li resolved HUDI-644. - Resolution: Fixed > checkpoint generator tool for delta streamer > > > Key: HUDI-644 > URL: https://issues.apache.org/jira/browse/HUDI-644 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This ticket is to resolve the following problem: > The user has finished the initial load and write to Hudi table > The user would like to migrate to Delta Streamer > The user needs a tool to provide the checkpoint for the Delta Streamer in the > first run. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-69: --- Status: In Progress (was: Open) > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > https://github.com/uber/hudi/issues/136 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076023#comment-17076023 ] Yanjia Gary Li commented on HUDI-69: Hello [~bhasudha], I found your commit [https://github.com/apache/incubator-hudi/commit/d09eacdc13b9f19f69a317c8d08bda69a43678bc] could be related to this ticket. Does InputPathHandler able to provide MOR snapshot paths(avro + parquet)? If not, I could probably start from the path selector. To add Spark Datasource support RealtimeUnmergedRecordReader, we may simply use the Spark SQL API to read two separate formats then union them together. Is that make sense? To merge them, I might need to dig deeper. > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > https://github.com/uber/hudi/issues/136 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-644) checkpoint generator tool for delta streamer
[ https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-644: Description: This ticket is to resolve the following problem: The user has finished the initial load and write to Hudi table The user would like to migrate to Delta Streamer The user needs a tool to provide the checkpoint for the Delta Streamer in the first run. was: This ticket is to resolve the following problem: The user is using a homebrew Spark data source to read new data and write to Hudi table The user would like to migrate to Delta Streamer But the Delta Streamer only checks the last commit metadata, if there is no checkpoint info, then the Delta Streamer will use the default. For Kafka source, it is LATEST. The user would like to run the homebrew Spark data source reader and Delta Streamer in parallel to prevent data loss, but the Spark data source writer will make commit without checkpoint info, which will reset the delta streamer. So if we have an option to allow the user to retrieve the checkpoint from previous commits instead of the latest commit would be helpful for the migration. > checkpoint generator tool for delta streamer > > > Key: HUDI-644 > URL: https://issues.apache.org/jira/browse/HUDI-644 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This ticket is to resolve the following problem: > The user has finished the initial load and write to Hudi table > The user would like to migrate to Delta Streamer > The user needs a tool to provide the checkpoint for the Delta Streamer in the > first run. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-69: --- Description: [https://github.com/uber/hudi/issues/136] RFC: [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] WIP commit: [https://github.com/garyli1019/incubator-hudi/pull/1] was: [https://github.com/uber/hudi/issues/136] RFC: [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > [https://github.com/uber/hudi/issues/136] > RFC: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] > WIP commit: [https://github.com/garyli1019/incubator-hudi/pull/1] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2
[ https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091042#comment-17091042 ] Yanjia Gary Li commented on HUDI-773: - Hello [~sasikumar.venkat], could you try the following: mount your storage account to Databricks {code:java} dbutils.fs.mount( source = "abfss://x...@xxx.dfs.core.windows.net", mountPoint = "/mountpoint", extraConfigs = configs) {code} When writing to Hudi, use the abfss URL {code:java} save("abfss://<>.dfs.core.windows.net/hudi-tables/customer"){code} When read Hudi data, use the mount point {code:java} load("/mountpoint/hudi-tables/customer") {code} I believe this error could be related to Databricks internal setup > Hudi On Azure Data Lake Storage V2 > -- > > Key: HUDI-773 > URL: https://issues.apache.org/jira/browse/HUDI-773 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Usability >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Minor > Fix For: 0.6.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-69: --- Description: [https://github.com/uber/hudi/issues/136] RFC: [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] was:https://github.com/uber/hudi/issues/136 > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > [https://github.com/uber/hudi/issues/136] > RFC: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li reopened HUDI-69: > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > [https://github.com/uber/hudi/issues/136] > RFC: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] > PR: [https://github.com/apache/incubator-hudi/pull/1592] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-69: --- Comment: was deleted (was: Can anyone reopen this ticket? I accidentally closed this :)) > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > [https://github.com/uber/hudi/issues/136] > RFC: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] > PR: [https://github.com/apache/incubator-hudi/pull/1592] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-69: --- Status: Closed (was: Patch Available) > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > [https://github.com/uber/hudi/issues/136] > RFC: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] > PR: [https://github.com/apache/incubator-hudi/pull/1592] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099507#comment-17099507 ] Yanjia Gary Li commented on HUDI-69: Can anyone reopen this ticket? I accidentally closed this :) > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > [https://github.com/uber/hudi/issues/136] > RFC: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] > PR: [https://github.com/apache/incubator-hudi/pull/1592] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-822) Decouple hoodie related methods with Hoodie Input Formats
[ https://issues.apache.org/jira/browse/HUDI-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-822: Status: In Progress (was: Open) > Decouple hoodie related methods with Hoodie Input Formats > - > > Key: HUDI-822 > URL: https://issues.apache.org/jira/browse/HUDI-822 > Project: Apache Hudi (incubating) > Issue Type: Sub-task >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > > In order to support multiple query engines, we need to generalize the Hudi > input format and Hudi record merging logic. And decouple from > MapredParquetInputFormat, which is depending on Hive. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-69: --- Description: [https://github.com/uber/hudi/issues/136] RFC: [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] PR: [https://github.com/apache/incubator-hudi/pull/1592] was: [https://github.com/uber/hudi/issues/136] RFC: [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] WIP commit: [https://github.com/garyli1019/incubator-hudi/pull/1] > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > [https://github.com/uber/hudi/issues/136] > RFC: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] > PR: [https://github.com/apache/incubator-hudi/pull/1592] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136
[ https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-69: --- Status: Patch Available (was: In Progress) > Support realtime view in Spark datasource #136 > -- > > Key: HUDI-69 > URL: https://issues.apache.org/jira/browse/HUDI-69 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Vinoth Chandar >Assignee: Yanjia Gary Li >Priority: Major > Fix For: 0.6.0 > > > [https://github.com/uber/hudi/issues/136] > RFC: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader] > WIP commit: [https://github.com/garyli1019/incubator-hudi/pull/1] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-494: Fix Version/s: 0.5.3 > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Labels: bug-bash-0.6.0, pull-request-available > Fix For: 0.5.3 > > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty
[ https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-528: Fix Version/s: 0.5.3 > Incremental Pull fails when latest commit is empty > -- > > Key: HUDI-528 > URL: https://issues.apache.org/jira/browse/HUDI-528 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Incremental Pull >Reporter: Javier Vega >Assignee: Yanjia Gary Li >Priority: Minor > Labels: bug-bash-0.6.0, help-requested, pull-request-available > Fix For: 0.5.3 > > > When trying to create an incremental view of a dataset, an exception is > thrown when the latest commit in the time range is empty. In order to > determine the schema of the dataset, Hudi will grab the [latest commit file, > parse it, and grab the first metadata file > path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80]. > If the latest commit was empty though, the field which is used to determine > file paths (partitionToWriteStats) will be empty causing the following > exception: > > > {code:java} > java.util.NoSuchElementException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447) > at java.util.HashMap$ValueIterator.next(HashMap.java:1474) > at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-318) Update Migration Guide to Include Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li reassigned HUDI-318: --- Assignee: (was: Yanjia Gary Li) > Update Migration Guide to Include Delta Streamer > > > Key: HUDI-318 > URL: https://issues.apache.org/jira/browse/HUDI-318 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Docs >Reporter: Yanjia Gary Li >Priority: Minor > Labels: doc > > [http://hudi.apache.org/migration_guide.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-110) Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-110: Status: In Progress (was: Open) > Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer > -- > > Key: HUDI-110 > URL: https://issues.apache.org/jira/browse/HUDI-110 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: DeltaStreamer, Spark Integration, Usability >Reporter: Balaji Varadarajan >Assignee: Yanjia Gary Li >Priority: Minor > Labels: bug-bash-0.6.0 > > Currently > SlashEncodedDayPartitionValueExtractor is the default being used. This is not > a common format outside Uber. > > Also, Spark DataSource provides partitionedBy clauses which has not been > integrated for Hudi Data Source. We need to investigate how we can leverage > partitionBy clause for partitioning. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-890) Prepare for 0.5.3 patch release
[ https://issues.apache.org/jira/browse/HUDI-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109805#comment-17109805 ] Yanjia Gary Li commented on HUDI-890: - Hi [~bhavanisudha] , #1602 HUDI-494 fix incorrect record size estimation was pushed to 0.6.0. Thanks > Prepare for 0.5.3 patch release > --- > > Key: HUDI-890 > URL: https://issues.apache.org/jira/browse/HUDI-890 > Project: Apache Hudi (incubating) > Issue Type: Task >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Major > Fix For: 0.5.3 > > > The following commits are included in this release. > * #1372 HUDI-652 Decouple HoodieReadClient and AbstractHoodieClient to break > the inheritance chain > * #1388 HUDI-681 Remove embeddedTimelineService from HoodieReadClient > * #1350 HUDI-629: Replace Guava's Hashing with an equivalent in > NumericUtils.java > * #1505 [HUDI - 738] Add validation to DeltaStreamer to fail fast when > filterDupes is enabled on UPSERT mode. > * #1517 HUDI-799 Use appropriate FS when loading configs > * #1406 HUDI-713 Fix conversion of Spark array of struct type to Avro schema > * #1394 HUDI-656[Performance] Return a dummy Spark relation after writing > the DataFrame > * #1576 HUDI-850 Avoid unnecessary listings in incremental cleaning mode > * #1421 HUDI-724 Parallelize getSmallFiles for partitions > * #1330 HUDI-607 Fix to allow creation/syncing of Hive tables partitioned by > Date type columns > * #1413 Add constructor to HoodieROTablePathFilter > * #1415 HUDI-539 Make ROPathFilter conf member serializable > * #1578 Add changes for presto mor queries > * #1506 HUDI-782 Add support of Aliyun object storage service. > * #1432 HUDI-716 Exception: Not an Avro data file when running > HoodieCleanClient.runClean > * #1422 HUDI-400 Check upgrade from old plan to new plan for compaction > * #1448 [MINOR] Update DOAP with 0.5.2 Release > * #1466 HUDI-742 Fix Java Math Exception > * #1416 HUDI-717 Fixed usage of HiveDriver for DDL statements. > * #1427 HUDI-727: Copy default values of fields if not present when > rewriting incoming record with new schema > * #1515 HUDI-795 Handle auto-deleted empty aux folder > * #1547 [MINOR]: Fix cli docs for DeltaStreamer > * #1580 HUDI-852 adding check for table name for Append Save mode > * #1537 [MINOR] fixed building IndexFileFilter with a wrong condition in > HoodieGlobalBloomIndex class > * #1434 HUDI-616 Fixed parquet files getting created on local FS > * #1633 HUDI-858 Allow multiple operations to be executed within a single > commit > * #1634 HUDI-846Enable Incremental cleaning and embedded timeline-server by > default > * #1596 HUDI-863 get decimal properties from derived spark DataType > * #1602 HUDI-494 fix incorrect record size estimation > * #1636 HUDI-895 Remove unnecessary listing .hoodie folder when using > timeline server > * #1584 HUDI-902 Avoid exception when getSchemaProvider > * #1612 HUDI-528 Handle empty commit in incremental pulling > * #1511 HUDI-789Adjust logic of upsert in HDFSParquetImporter > * #1627 HUDI-889 Writer supports useJdbc configuration when hive > synchronization is enabled -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-494: Fix Version/s: (was: 0.5.3) > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: lamber-ken >Priority: Major > Labels: bug-bash-0.6.0, pull-request-available > Fix For: 0.6.0 > > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-905) Support native filter pushdown for Spark Datasource
Yanjia Gary Li created HUDI-905: --- Summary: Support native filter pushdown for Spark Datasource Key: HUDI-905 URL: https://issues.apache.org/jira/browse/HUDI-905 Project: Apache Hudi (incubating) Issue Type: New Feature Reporter: Yanjia Gary Li Assignee: Yanjia Gary Li -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101207#comment-17101207 ] Yanjia Gary Li commented on HUDI-494: - Commit 1: {code:java} "partitionToWriteStats" : { "year=2020/month=5/day=0/hour=0" : [ { "fileId" : "4aee295a-4bbd-4c74-ba49-f6d50f489524-0", "path" : "year=2020/month=5/day=0/hour=0/4aee295a-4bbd-4c74-ba49-f6d50f489524-0_0-112-1773_20200504101048.parquet", "prevCommit" : "null", "numWrites" : 21, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 21, "totalWriteBytes" : 14397559, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "year=2020/month=5/day=0/hour=0", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 14397559 } {code} Commit2: {code:java} "partitionToWriteStats" : { "year=2020/month=5/day=0/hour=0" : [ { "fileId" : "4aee295a-4bbd-4c74-ba49-f6d50f489524-0", "path" : "year=2020/month=5/day=0/hour=0/4aee295a-4bbd-4c74-ba49-f6d50f489524-0_0-248-163129_20200505023830.parquet", "prevCommit" : "20200504101048", "numWrites" : 12817, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 12796, "totalWriteBytes" : 16297335, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "year=2020/month=5/day=0/hour=0", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 16297335 }, { "fileId" : "9d0c9e79-00dd-41d2-a217-0944f8428e1c-0", "path" : "year=2020/month=5/day=0/hour=0/9d0c9e79-00dd-41d2-a217-0944f8428e1c-0_1-248-163130_20200505023830.parquet", "prevCommit" : "null", "numWrites" : 200, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 200, "totalWriteBytes" : 14428883, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "year=2020/month=5/day=0/hour=0", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 14428883 }, { "fileId" : "5990beb4-bd0c-40c9-84f1-a4107287971e-0", "path" : "year=2020/month=5/day=0/hour=0/5990beb4-bd0c-40c9-84f1-a4107287971e-0_2-248-163131_20200505023830.parquet", "prevCommit" : "null", "numWrites" : 198, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 198, "totalWriteBytes" : 14428338, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "year=2020/month=5/day=0/hour=0", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 14428338 }, { "fileId" : "673c5550-39c3-4611-ac68-bc0c7da065e2-0", "path" : "year=2020/month=5/day=0/hour=0/673c5550-39c3-4611-ac68-bc0c7da065e2-0_3-248-163132_20200505023830.parquet", "prevCommit" : "null", "numWrites" : 179, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 179, "totalWriteBytes" : 14425571, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "year=2020/month=5/day=0/hour=0", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 14425571 } {code} > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result >
[jira] [Assigned] (HUDI-528) Incremental Pull fails when latest commit is empty
[ https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li reassigned HUDI-528: --- Assignee: Yanjia Gary Li > Incremental Pull fails when latest commit is empty > -- > > Key: HUDI-528 > URL: https://issues.apache.org/jira/browse/HUDI-528 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Incremental Pull >Reporter: Javier Vega >Assignee: Yanjia Gary Li >Priority: Minor > Labels: bug-bash-0.6.0, help-requested > > When trying to create an incremental view of a dataset, an exception is > thrown when the latest commit in the time range is empty. In order to > determine the schema of the dataset, Hudi will grab the [latest commit file, > parse it, and grab the first metadata file > path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80]. > If the latest commit was empty though, the field which is used to determine > file paths (partitionToWriteStats) will be empty causing the following > exception: > > > {code:java} > java.util.NoSuchElementException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447) > at java.util.HashMap$ValueIterator.next(HashMap.java:1474) > at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-494: Status: In Progress (was: Open) > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Labels: bug-bash-0.6.0, pull-request-available > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty
[ https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-528: Status: In Progress (was: Open) > Incremental Pull fails when latest commit is empty > -- > > Key: HUDI-528 > URL: https://issues.apache.org/jira/browse/HUDI-528 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Incremental Pull >Reporter: Javier Vega >Assignee: Yanjia Gary Li >Priority: Minor > Labels: bug-bash-0.6.0, help-requested, pull-request-available > > When trying to create an incremental view of a dataset, an exception is > thrown when the latest commit in the time range is empty. In order to > determine the schema of the dataset, Hudi will grab the [latest commit file, > parse it, and grab the first metadata file > path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80]. > If the latest commit was empty though, the field which is used to determine > file paths (partitionToWriteStats) will be empty causing the following > exception: > > > {code:java} > java.util.NoSuchElementException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447) > at java.util.HashMap$ValueIterator.next(HashMap.java:1474) > at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-528) Incremental Pull fails when latest commit is empty
[ https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li resolved HUDI-528. - Resolution: Fixed > Incremental Pull fails when latest commit is empty > -- > > Key: HUDI-528 > URL: https://issues.apache.org/jira/browse/HUDI-528 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Incremental Pull >Reporter: Javier Vega >Assignee: Yanjia Gary Li >Priority: Minor > Labels: bug-bash-0.6.0, help-requested, pull-request-available > Fix For: 0.5.3 > > > When trying to create an incremental view of a dataset, an exception is > thrown when the latest commit in the time range is empty. In order to > determine the schema of the dataset, Hudi will grab the [latest commit file, > parse it, and grab the first metadata file > path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80]. > If the latest commit was empty though, the field which is used to determine > file paths (partitionToWriteStats) will be empty causing the following > exception: > > > {code:java} > java.util.NoSuchElementException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447) > at java.util.HashMap$ValueIterator.next(HashMap.java:1474) > at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100967#comment-17100967 ] Yanjia Gary Li commented on HUDI-494: - Hi folks, this issue seems coming back again... !example2_hdfs.png! !example2_sparkui.png! A very small(2GB) upsert job creates 60,000+ files in a single partition and gets stuck for 10+ hours. I believe there might be a bug on the BloomIndexing stage. > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Vinoth Chandar >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-494: Status: Open (was: New) > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li reassigned HUDI-494: --- Assignee: Yanjia Gary Li (was: Vinoth Chandar) > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101055#comment-17101055 ] Yanjia Gary Li commented on HUDI-494: - Ok, I see what happened here. Root cause is [https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214] So basically commit 1 wrote a very small file(let's say 200 records) to a new partition day=05. And then when commit 2 trying to write to day=05, it will look up the affected partition and use the Bloom index range from the existing files, so it will use 200 here. Commit 2 has much more records than 200, so it will create tons of files since the Bloom index range is too small. I am not really familiar with the indexing part of the code. Please let me know if I understand this correctly and we can figure out a fix. [~lamber-ken] [~vinoth] > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-494: Attachment: example2_hdfs.png > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Vinoth Chandar >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-494: Attachment: example2_sparkui.png > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Vinoth Chandar >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS
[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101055#comment-17101055 ] Yanjia Gary Li edited comment on HUDI-494 at 5/8/20, 1:38 AM: -- -Ok, I see what happened here. Root cause is [https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]- So basically commit 1 wrote a very small file(let's say 200 records) to a new partition day=05. And then when commit 2 was trying to write, it looks back to commit 1 to get an estimated size of every record, but because commit 1 has too little records so it's inaccurate and way too big. So Hudi will calculate record/file using the big record size number and get a very small record/file. This lead to many small files. was (Author: garyli1019): Ok, I see what happened here. Root cause is [https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214] So basically commit 1 wrote a very small file(let's say 200 records) to a new partition day=05. And then when commit 2 trying to write to day=05, it will look up the affected partition and use the Bloom index range from the existing files, so it will use 200 here. Commit 2 has much more records than 200, so it will create tons of files since the Bloom index range is too small. I am not really familiar with the indexing part of the code. Please let me know if I understand this correctly and we can figure out a fix. [~lamber-ken] [~vinoth] > [DEBUGGING] Huge amount of tasks when writing files into HDFS > - > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, > image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-905) Support PrunedFilteredScan for Spark Datasource
[ https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-905: Priority: Minor (was: Major) > Support PrunedFilteredScan for Spark Datasource > --- > > Key: HUDI-905 > URL: https://issues.apache.org/jira/browse/HUDI-905 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Priority: Minor > > Hudi Spark Datasource incremental view currently is using > DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter. > If we wanna use Spark predicate pushdown in a native way, we need to > implement PrunedFilteredScan for Hudi Datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-905) Support PrunedFilteredScan for Spark Datasource
[ https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-905: Status: Open (was: New) > Support PrunedFilteredScan for Spark Datasource > --- > > Key: HUDI-905 > URL: https://issues.apache.org/jira/browse/HUDI-905 > Project: Apache Hudi (incubating) > Issue Type: New Feature >Reporter: Yanjia Gary Li >Priority: Minor > > Hudi Spark Datasource incremental view currently is using > DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter. > If we wanna use Spark predicate pushdown in a native way, we need to > implement PrunedFilteredScan for Hudi Datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-905) Support PrunedFilteredScan for Spark Datasource
[ https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanjia Gary Li updated HUDI-905: Component/s: Spark Integration > Support PrunedFilteredScan for Spark Datasource > --- > > Key: HUDI-905 > URL: https://issues.apache.org/jira/browse/HUDI-905 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Spark Integration >Reporter: Yanjia Gary Li >Priority: Minor > > Hudi Spark Datasource incremental view currently is using > DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter. > If we wanna use Spark predicate pushdown in a native way, we need to > implement PrunedFilteredScan for Hudi Datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005)