[jira] [Created] (HUDI-7773) Allow Users to extend S3/GCS HoodieIncrSource to bring in additional columns from upstream

2024-05-17 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-7773:


 Summary: Allow Users to extend S3/GCS HoodieIncrSource to bring in 
additional columns from upstream
 Key: HUDI-7773
 URL: https://issues.apache.org/jira/browse/HUDI-7773
 Project: Apache Hudi
  Issue Type: Improvement
  Components: deltastreamer
Reporter: Balaji Varadarajan
Assignee: Balaji Varadarajan


Current S3/GCS HoodieIncrSource reads file-paths from upstream tables and 
ingests to downstream tables. We need ability to extend this functionality by 
joining additional columns in the upstream table before writing to the 
downstream table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7674) Hudi CLI : Command "metadata validate-files" not using file listing to validate

2024-04-25 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-7674:


 Summary: Hudi CLI : Command "metadata validate-files" not using 
file listing to validate
 Key: HUDI-7674
 URL: https://issues.apache.org/jira/browse/HUDI-7674
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Balaji Varadarajan
Assignee: Balaji Varadarajan


metadata validate-files is expected to compare file system view provided by 
metadata layer against raw file listing. but this is broken. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7008) Fixing usage of Kafka Avro deserializer w/ debezium sources

2023-10-30 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-7008:


Assignee: sivabalan narayanan

> Fixing usage of Kafka Avro deserializer w/ debezium sources
> ---
>
> Key: HUDI-7008
> URL: https://issues.apache.org/jira/browse/HUDI-7008
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Balaji Varadarajan
>Assignee: sivabalan narayanan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7008) Fixing usage of Kafka Avro deserializer w/ debezium sources

2023-10-30 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-7008:


 Summary: Fixing usage of Kafka Avro deserializer w/ debezium 
sources
 Key: HUDI-7008
 URL: https://issues.apache.org/jira/browse/HUDI-7008
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Balaji Varadarajan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5933) Fix NullPointer Exception in MultiTableDeltaStreamer when Transformer_class config is not set

2023-03-14 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-5933:


 Summary: Fix NullPointer Exception in MultiTableDeltaStreamer when 
Transformer_class config is not set
 Key: HUDI-5933
 URL: https://issues.apache.org/jira/browse/HUDI-5933
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Balaji Varadarajan


Context : https://github.com/apache/hudi/pull/6726#issuecomment-1468270289



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-2761) IllegalArgException from timeline server when serving getLastestBaseFiles with multi-writer

2021-11-23 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448275#comment-17448275
 ] 

Balaji Varadarajan commented on HUDI-2761:
--

[~shivnarayan] :Not sure if I understood why you think (a) is infeasible. For 
this case, when the first time it fails, the driver would have already updated 
to latest commit so it should not error out unless another commit comes.  (a) 
would keep/reduce the FS calls only from driver and (b) could cause increase in 
FS calls. 

I think we should look at handling this in RemoteHoodieTableFileSystemView and 
retry (once) before it gives up and the executor loads the filpystem view 
locally. 

Regarding the exception stack trace, I agree we can make it a INFO message 
without dumping the stack trace. 

> IllegalArgException from timeline server when serving getLastestBaseFiles 
> with multi-writer
> ---
>
> Key: HUDI-2761
> URL: https://issues.apache.org/jira/browse/HUDI-2761
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.10.0
>
> Attachments: Screen Shot 2021-11-15 at 8.27.11 AM.png, Screen Shot 
> 2021-11-15 at 8.27.33 AM.png, Screen Shot 2021-11-15 at 8.28.03 AM.png, 
> Screen Shot 2021-11-15 at 8.28.25 AM.png
>
>
> When concurrent writes try to ingest to hudi, occasionally, we run into 
> IllegalArgumentException as below. Even though exception is seen, the actual 
> write succeeds though. 
> Here is what is happening from my understanding. 
>  
> Lets say table's latest commit is C3. 
> Writer1 tries to commit C4, writer2 tries to do C5 and writer3 tries to do C6 
> (all 3 are non-overlapping and so expected to succeed) 
> I started C4 from writer1 and then switched to writer 2 and triggered C5 and 
> then did the same for writer3. 
> C4 went through fine for writer1 and succeeded. 
> for writer2, when timeline got instantiated, it's latest snapshot was C3, but 
> when it received the getLatestBaseFiles() request, latest commit was C4 and 
> so it throws an exception. Similar issue happend w/ writer3 as well. 
>  
> {code:java}
> scala> df.write.format("hudi").
>      |   options(getQuickstartWriteConfigs).
>      |   option(PRECOMBINE_FIELD.key(), "created_at").
>      |   option(RECORDKEY_FIELD.key(), "other").
>      |   option(PARTITIONPATH_FIELD.key(), "type").
>      |   option("hoodie.cleaner.policy.failed.writes","LAZY").
>      |   
> option("hoodie.write.concurrency.mode","OPTIMISTIC_CONCURRENCY_CONTROL").
>      |   
> option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider").
>      |   option("hoodie.write.lock.zookeeper.url","localhost").
>      |   option("hoodie.write.lock.zookeeper.port","2181").
>      |   option("hoodie.write.lock.zookeeper.lock_key","locks").
>      |   
> option("hoodie.write.lock.zookeeper.base_path","/tmp/mw_testing/.locks").
>      |   option(TBL_NAME.key(), tableName).
>      |   mode(Append).
>      |   save(basePath)
> 21/11/15 07:47:33 WARN HoodieSparkSqlWriter$: Commit time 2025074733457
> 21/11/15 07:47:35 WARN EmbeddedTimelineService: Started embedded timeline 
> server at 10.0.0.202:57644
> [Stage 2:>                                                        (0          
>                                                           21/11/15 07:47:39 
> ERROR RequestHandler: Got runtime exception servicing request 
> partition=CreateEvent=2025074301094=file%3A%2Ftmp%2Fmw_testing%2Ftrial2=2025074301094=ce963fe977a9d2176fadecf16c223cb3b98d7f6f7aaaf41cd7855eb098aee47d
> java.lang.IllegalArgumentException: Last known instant from client was 
> 2025074301094 but server has the following timeline 
> [[2025074301094__commit__COMPLETED], 
> [2025074731908__commit__COMPLETED]]
>     at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
>     at 
> org.apache.hudi.timeline.service.RequestHandler$ViewHandler.handle(RequestHandler.java:510)
>     at io.javalin.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:22)
>     at io.javalin.Javalin.lambda$addHandler$0(Javalin.java:606)
>     at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:46)
>     at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:17)
>     at io.javalin.core.JavalinServlet$service$1.invoke(JavalinServlet.kt:143)
>     at io.javalin.core.JavalinServlet$service$2.invoke(JavalinServlet.kt:41)
>     at io.javalin.core.JavalinServlet.service(JavalinServlet.kt:107)
>     at 
> io.javalin.core.util.JettyServerUtil$initialize$httpHandler$1.doHandle(JettyServerUtil.kt:72)
>     at 
> 

[jira] [Created] (HUDI-2166) Support Alter table drop column

2021-07-12 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-2166:


 Summary: Support Alter table drop column 
 Key: HUDI-2166
 URL: https://issues.apache.org/jira/browse/HUDI-2166
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: Balaji Varadarajan
Assignee: pengzhiwei


Just like adding and renaming columns, we need DML support for dropping column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1741) Row Level TTL Support for records stored in Hudi

2021-03-30 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1741:


 Summary: Row Level TTL Support for records stored in Hudi
 Key: HUDI-1741
 URL: https://issues.apache.org/jira/browse/HUDI-1741
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Utilities
Reporter: Balaji Varadarajan


For e:g : Have records only updated last month 

 

GH: https://github.com/apache/hudi/issues/2743



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1741) Row Level TTL Support for records stored in Hudi

2021-03-30 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311938#comment-17311938
 ] 

Balaji Varadarajan commented on HUDI-1741:
--

[~shivnarayan] : FYI

> Row Level TTL Support for records stored in Hudi
> 
>
> Key: HUDI-1741
> URL: https://issues.apache.org/jira/browse/HUDI-1741
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Utilities
>Reporter: Balaji Varadarajan
>Priority: Major
>
> For e:g : Have records only updated last month 
>  
> GH: https://github.com/apache/hudi/issues/2743



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1724) run_sync_tool support for hive3.1.2 on hadoop3.1.4

2021-03-26 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17309272#comment-17309272
 ] 

Balaji Varadarajan commented on HUDI-1724:
--

[~shivnarayan] : Can you please triage this

> run_sync_tool support for hive3.1.2 on hadoop3.1.4
> --
>
> Key: HUDI-1724
> URL: https://issues.apache.org/jira/browse/HUDI-1724
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Priority: Major
>
> Context: https://github.com/apache/hudi/issues/2717



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1724) run_sync_tool support for hive3.1.2 on hadoop3.1.4

2021-03-26 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1724:


 Summary: run_sync_tool support for hive3.1.2 on hadoop3.1.4
 Key: HUDI-1724
 URL: https://issues.apache.org/jira/browse/HUDI-1724
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Reporter: Balaji Varadarajan


Context: https://github.com/apache/hudi/issues/2717



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1711) Avro Schema Exception with Spark 3.0 in 0.7

2021-03-23 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307095#comment-17307095
 ] 

Balaji Varadarajan commented on HUDI-1711:
--

[~shivnarayan]: Can you triage this issue when you get a chance.

> Avro Schema Exception with Spark 3.0 in 0.7
> ---
>
> Key: HUDI-1711
> URL: https://issues.apache.org/jira/browse/HUDI-1711
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Priority: Major
>
> GH: [https://github.com/apache/hudi/issues/2705]
>  
>  
> {{21/03/22 10:10:35 WARN util.package: Truncated the string representation of 
> a plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> 21/03/22 10:10:35 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while decoding: 
> java.lang.NegativeArraySizeException: -1255727808
> createexternalrow(if (isnull(input[0, 
> struct,
>  true])) null else createexternalrow(if (input[0, 
> struct,
>  true].isNullAt) null else input[0, 
> struct,
>  true].id, if (input[0, 
> struct,
>  true].isNullAt) null else input[0, 
> struct,
>  true].name.toString, if (input[0, 
> struct,
>  true].isNullAt) null else input[0, 
> struct,
>  true].type.toString, if (input[0, 
> struct,
>  true].isNullAt) null else input[0, 
> struct,
>  true].url.toString, if (input[0, 
> struct,
>  true].isNullAt) null else input[0, 
> struct,
>  true].user.toString, if (input[0, 
> struct,
>  true].isNullAt) null else input[0, 
> struct,
>  true].password.toString, if (input[0, 
> struct,
>  true].isNullAt) null else input[0, 
> struct,
>  true].create_time.toString, if (input[0, 
> struct,
>  true].isNullAt) null else input[0, 
> struct,
>  true].create_user.toString, if (input[0, 
> struct,
>  true].isNullAt) null else input[0, 
> struct,
>  true].update_time.toString, if (input[0, 
> struct,
>  true].isNullAt) null else input[0, 
> struct,
>  true].update_user.toString, if (input[0, 
> struct,
>  true].isNullAt) null else input[0, 
> struct,
>  true].del_flag, StructField(id,IntegerType,false), 
> StructField(name,StringType,true), StructField(type,StringType,true), 
> StructField(url,StringType,true), StructField(user,StringType,true), 
> StructField(password,StringType,true), 
> StructField(create_time,StringType,true), 
> StructField(create_user,StringType,true), 
> StructField(update_time,StringType,true), 
> StructField(update_user,StringType,true), 
> StructField(del_flag,IntegerType,true)), if (isnull(input[1, 
> struct,
>  true])) null else createexternalrow(if (input[1, 
> struct,
>  true].isNullAt) null else input[1, 
> struct,
>  true].id, if (input[1, 
> struct,
>  true].isNullAt) null else input[1, 
> struct,
>  true].name.toString, if (input[1, 
> struct,
>  true].isNullAt) null else input[1, 
> struct,
>  true].type.toString, if (input[1, 
> struct,
>  true].isNullAt) null else input[1, 
> struct,
>  true].url.toString, if (input[1, 
> struct,
>  true].isNullAt) null else input[1, 
> struct,
>  true].user.toString, if (input[1, 
> struct,
>  true].isNullAt) null else input[1, 
> struct,
>  true].password.toString, if (input[1, 
> struct,
>  true].isNullAt) null else input[1, 
> struct,
>  true].create_time.toString, if (input[1, 
> struct,
>  true].isNullAt) null else input[1, 
> struct,
>  true].create_user.toString, if (input[1, 
> struct,
>  true].isNullAt) null else input[1, 
> struct,
>  true].update_time.toString, if (input[1, 
> struct,
>  true].isNullAt) null else input[1, 
> struct,
>  true].update_user.toString, if (input[1, 
> struct,
>  true].isNullAt) null else input[1, 
> struct,
>  true].del_flag, StructField(id,IntegerType,false), 
> StructField(name,StringType,true), StructField(type,StringType,true), 
> StructField(url,StringType,true), StructField(user,StringType,true), 
> StructField(password,StringType,true), 
> StructField(create_time,StringType,true), 
> StructField(create_user,StringType,true), 
> StructField(update_time,StringType,true), 
> StructField(update_user,StringType,true), 
> StructField(del_flag,IntegerType,true)), if (isnull(input[2, 
> struct,
>  false])) null else createexternalrow(if (input[2, 
> struct,
>  false].isNullAt) null else input[2, 
> struct,
>  false].version.toString, if (input[2, 
> struct,
>  false].isNullAt) null else input[2, 
> struct,
>  false].connector.toString, if (input[2, 
> struct,
>  false].isNullAt) null else input[2, 
> struct,
>  false].name.toString, if (input[2, 
> struct,
>  false].isNullAt) null else input[2, 
> struct,
>  false].ts_ms, if (input[2, 
> struct,
>  false].isNullAt) null else input[2, 
> struct,
>  false].snapshot.toString, if (input[2, 
> struct,
>  false].isNullAt) null else input[2, 
> struct,
>  

[jira] [Created] (HUDI-1711) Avro Schema Exception with Spark 3.0 in 0.7

2021-03-23 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1711:


 Summary: Avro Schema Exception with Spark 3.0 in 0.7
 Key: HUDI-1711
 URL: https://issues.apache.org/jira/browse/HUDI-1711
 Project: Apache Hudi
  Issue Type: Bug
  Components: DeltaStreamer
Reporter: Balaji Varadarajan


GH: [https://github.com/apache/hudi/issues/2705]

 

 

{{21/03/22 10:10:35 WARN util.package: Truncated the string representation of a 
plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
21/03/22 10:10:35 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while decoding: 
java.lang.NegativeArraySizeException: -1255727808
createexternalrow(if (isnull(input[0, 
struct,
 true])) null else createexternalrow(if (input[0, 
struct,
 true].isNullAt) null else input[0, 
struct,
 true].id, if (input[0, 
struct,
 true].isNullAt) null else input[0, 
struct,
 true].name.toString, if (input[0, 
struct,
 true].isNullAt) null else input[0, 
struct,
 true].type.toString, if (input[0, 
struct,
 true].isNullAt) null else input[0, 
struct,
 true].url.toString, if (input[0, 
struct,
 true].isNullAt) null else input[0, 
struct,
 true].user.toString, if (input[0, 
struct,
 true].isNullAt) null else input[0, 
struct,
 true].password.toString, if (input[0, 
struct,
 true].isNullAt) null else input[0, 
struct,
 true].create_time.toString, if (input[0, 
struct,
 true].isNullAt) null else input[0, 
struct,
 true].create_user.toString, if (input[0, 
struct,
 true].isNullAt) null else input[0, 
struct,
 true].update_time.toString, if (input[0, 
struct,
 true].isNullAt) null else input[0, 
struct,
 true].update_user.toString, if (input[0, 
struct,
 true].isNullAt) null else input[0, 
struct,
 true].del_flag, StructField(id,IntegerType,false), 
StructField(name,StringType,true), StructField(type,StringType,true), 
StructField(url,StringType,true), StructField(user,StringType,true), 
StructField(password,StringType,true), 
StructField(create_time,StringType,true), 
StructField(create_user,StringType,true), 
StructField(update_time,StringType,true), 
StructField(update_user,StringType,true), 
StructField(del_flag,IntegerType,true)), if (isnull(input[1, 
struct,
 true])) null else createexternalrow(if (input[1, 
struct,
 true].isNullAt) null else input[1, 
struct,
 true].id, if (input[1, 
struct,
 true].isNullAt) null else input[1, 
struct,
 true].name.toString, if (input[1, 
struct,
 true].isNullAt) null else input[1, 
struct,
 true].type.toString, if (input[1, 
struct,
 true].isNullAt) null else input[1, 
struct,
 true].url.toString, if (input[1, 
struct,
 true].isNullAt) null else input[1, 
struct,
 true].user.toString, if (input[1, 
struct,
 true].isNullAt) null else input[1, 
struct,
 true].password.toString, if (input[1, 
struct,
 true].isNullAt) null else input[1, 
struct,
 true].create_time.toString, if (input[1, 
struct,
 true].isNullAt) null else input[1, 
struct,
 true].create_user.toString, if (input[1, 
struct,
 true].isNullAt) null else input[1, 
struct,
 true].update_time.toString, if (input[1, 
struct,
 true].isNullAt) null else input[1, 
struct,
 true].update_user.toString, if (input[1, 
struct,
 true].isNullAt) null else input[1, 
struct,
 true].del_flag, StructField(id,IntegerType,false), 
StructField(name,StringType,true), StructField(type,StringType,true), 
StructField(url,StringType,true), StructField(user,StringType,true), 
StructField(password,StringType,true), 
StructField(create_time,StringType,true), 
StructField(create_user,StringType,true), 
StructField(update_time,StringType,true), 
StructField(update_user,StringType,true), 
StructField(del_flag,IntegerType,true)), if (isnull(input[2, 
struct,
 false])) null else createexternalrow(if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].version.toString, if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].connector.toString, if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].name.toString, if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].ts_ms, if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].snapshot.toString, if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].db.toString, if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].table.toString, if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].server_id, if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].gtid.toString, if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].file.toString, if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].pos, if (input[2, 
struct,
 false].isNullAt) null else input[2, 
struct,
 false].row, if (input[2, 
struct,
 false].isNullAt) 

[jira] [Commented] (HUDI-1640) Implement Spark Datasource option to read hudi configs from properties file

2021-02-25 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290922#comment-17290922
 ] 

Balaji Varadarajan commented on HUDI-1640:
--

[~shivnarayan]: Can you vet this and add to the work queue ?

> Implement Spark Datasource option to read hudi configs from properties file
> ---
>
> Key: HUDI-1640
> URL: https://issues.apache.org/jira/browse/HUDI-1640
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Priority: Major
>
> Provide config option like "hoodie.datasource.props.file" to load all the 
> options from a file.
>  
> GH: https://github.com/apache/hudi/issues/2605



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1640) Implement Spark Datasource option to read hudi configs from properties file

2021-02-25 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1640:


 Summary: Implement Spark Datasource option to read hudi configs 
from properties file
 Key: HUDI-1640
 URL: https://issues.apache.org/jira/browse/HUDI-1640
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Spark Integration
Reporter: Balaji Varadarajan


Provide config option like "hoodie.datasource.props.file" to load all the 
options from a file.

 

GH: https://github.com/apache/hudi/issues/2605



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1608) MOR fetches all records for read optimized query w/ spark sql

2021-02-10 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282851#comment-17282851
 ] 

Balaji Varadarajan commented on HUDI-1608:
--

[~shivnarayan]: You need to set spark.sql.hive.convertMetastoreParquet=false 
(https://hudi.apache.org/docs/querying_data.html#spark-sql)

> MOR fetches all records for read optimized query w/ spark sql
> -
>
> Key: HUDI-1608
> URL: https://issues.apache.org/jira/browse/HUDI-1608
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: sev:critical, user-support-issues
>
> Script to reproduce in local spark:
>  
> [https://gist.github.com/nsivabalan/7250b794788516f1aec35650c2632364]
>  
> ```
> scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, id, __op from hudi_trips_snapshot order by 
> _hoodie_record_key").show(false)
> +-++++-+
> |_hoodie_commit_time|_hoodie_record_key|_hoodie_partition_path|id|__op|
> +-++++-+
> |20210210070347    |1                |1970-01-01           |1 |null|
> |20210210070347    |2                |1970-01-01           |2 |null|
> |20210210070347    |3                |2020-01-04           |3 |D  |
> |20210210070347    |4                |1998-04-13           |4 |I  |
> |20210210070347    |5                |2020-01-01           |5 |I  |
> |*20210210070445*    |*6*                |*1998-04-13*           |*6* |*I*  |
> +-++++-+
> ```
> After an upsert, read optimized query returns records from both C1 and C2. 
> Also, I don't find any log files in partitions. all of them are parquet 
> files. 
>  
> ls /tmp/hudi_trips_cow/1998-04-13/
> 0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-23-12025_20210210065058.parquet
> 0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-61-25595_20210210065127.parquet
> ls /tmp/hudi_trips_cow/1970-01-01/
> 7b836833-a656-485d-967a-871bdc653dc3-0_2-61-25596_20210210065127.parquet
> 7b836833-a656-485d-967a-871bdc653dc3-0_3-23-12027_20210210065058.parquet
>  
> Source of the issue: [https://github.com/apache/hudi/issues/2255]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1523) Avoid excessive mkdir calls when creating new files

2021-01-11 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1523:


 Summary: Avoid excessive mkdir calls when creating new files
 Key: HUDI-1523
 URL: https://issues.apache.org/jira/browse/HUDI-1523
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: Balaji Varadarajan
 Fix For: 0.8.0


https://github.com/apache/hudi/issues/2423



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1505) Allow pluggable option to write error records to side table, queue

2021-01-04 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1505:


 Summary: Allow pluggable option to write error records to side 
table, queue
 Key: HUDI-1505
 URL: https://issues.apache.org/jira/browse/HUDI-1505
 Project: Apache Hudi
  Issue Type: New Feature
  Components: DeltaStreamer
Reporter: Balaji Varadarajan
 Fix For: 0.8.0


Context : https://github.com/apache/hudi/issues/2401



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1501) Explore providing ways to auto-tune input record size based on incoming payload

2020-12-31 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1501:


 Summary: Explore providing ways to auto-tune input record size 
based on incoming payload
 Key: HUDI-1501
 URL: https://issues.apache.org/jira/browse/HUDI-1501
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Writer Core
Reporter: Balaji Varadarajan
 Fix For: 0.8.0


Context: https://github.com/apache/hudi/issues/2393#issuecomment-752452753



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1501) Explore providing ways to auto-tune input record size based on incoming payload

2020-12-31 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1501:
-
Status: Open  (was: New)

> Explore providing ways to auto-tune input record size based on incoming 
> payload
> ---
>
> Key: HUDI-1501
> URL: https://issues.apache.org/jira/browse/HUDI-1501
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Minor
> Fix For: 0.8.0
>
>
> Context: https://github.com/apache/hudi/issues/2393#issuecomment-752452753



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1499) Support configuration to let user override record-size estimate

2020-12-29 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1499:


Assignee: sivabalan narayanan

> Support configuration to let user override record-size estimate  
> -
>
> Key: HUDI-1499
> URL: https://issues.apache.org/jira/browse/HUDI-1499
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: newbie
> Fix For: 0.8.0
>
>
> Context: [https://github.com/apache/hudi/issues/2393]
>  
> It would be helpful if for some reason the user needs to ingest a batch of 
> records which has a very different record sizes compared to existing records. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1499) Support configuration to let user override record-size estimate

2020-12-29 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1499:


 Summary: Support configuration to let user override record-size 
estimate  
 Key: HUDI-1499
 URL: https://issues.apache.org/jira/browse/HUDI-1499
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: Balaji Varadarajan
 Fix For: 0.8.0


Context: [https://github.com/apache/hudi/issues/2393]

 

It would be helpful if for some reason the user needs to ingest a batch of 
records which has a very different record sizes compared to existing records. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1499) Support configuration to let user override record-size estimate

2020-12-29 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1499:
-
Status: Open  (was: New)

> Support configuration to let user override record-size estimate  
> -
>
> Key: HUDI-1499
> URL: https://issues.apache.org/jira/browse/HUDI-1499
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: newbie
> Fix For: 0.8.0
>
>
> Context: [https://github.com/apache/hudi/issues/2393]
>  
> It would be helpful if for some reason the user needs to ingest a batch of 
> records which has a very different record sizes compared to existing records. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1497) Timeout Exception during getFileStatus()

2020-12-28 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1497:
-
Status: Open  (was: New)

> Timeout Exception during getFileStatus() 
> -
>
> Key: HUDI-1497
> URL: https://issues.apache.org/jira/browse/HUDI-1497
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
>
> Seeing this happening when running RFC-15 branch in long running mode. There 
> could be a resource leak as I am seeing this consistently after every 1 or 2 
> hour period runs.  The below log shows it is during accessing bootstrap index 
> but I am seeing it in getFileStatus() for other files too.
>  
>  
> Caused by: java.io.InterruptedIOException: getFileStatus on 
> s3://robinhood-encrypted-hudi-data-cove/dummy/balaji/sickle/public/client_ledger_clientledgerbalance/test_v4/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile:
>  com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout 
> waiting for connection from poolCaused by: java.io.InterruptedIOException: 
> getFileStatus on 
> s3://robinhood-encrypted-hudi-data-cove/dummy/balaji/sickle/public/client_ledger_clientledgerbalance/test_v4/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile:
>  com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout 
> waiting for connection from pool at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:141) at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:117) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1859)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1823)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1763) 
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1627) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:2500) at 
> org.apache.hudi.common.fs.HoodieWrapperFileSystem.exists(HoodieWrapperFileSystem.java:549)
>  at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.(HFileBootstrapIndex.java:102)
>  ... 33 moreCaused by: com.amazonaws.SdkClientException: Unable to execute 
> HTTP request: Timeout waiting for connection from pool at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1113)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1063)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
>  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4229) at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4176) at 
> com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1253)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1053)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1841)
>  ... 39 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1497) Timeout Exception during getFileStatus()

2020-12-28 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1497:


 Summary: Timeout Exception during getFileStatus() 
 Key: HUDI-1497
 URL: https://issues.apache.org/jira/browse/HUDI-1497
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: Balaji Varadarajan


Seeing this happening when running RFC-15 branch in long running mode. There 
could be a resource leak as I am seeing this consistently after every 1 or 2 
hour period runs.  The below log shows it is during accessing bootstrap index 
but I am seeing it in getFileStatus() for other files too.

 

 

Caused by: java.io.InterruptedIOException: getFileStatus on 
s3://robinhood-encrypted-hudi-data-cove/dummy/balaji/sickle/public/client_ledger_clientledgerbalance/test_v4/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile:
 com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout 
waiting for connection from poolCaused by: java.io.InterruptedIOException: 
getFileStatus on 
s3://robinhood-encrypted-hudi-data-cove/dummy/balaji/sickle/public/client_ledger_clientledgerbalance/test_v4/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile:
 com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout 
waiting for connection from pool at 
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:141) at 
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:117) at 
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1859) 
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1823)
 at 
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1763) 
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1627) at 
org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:2500) at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.exists(HoodieWrapperFileSystem.java:549)
 at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.(HFileBootstrapIndex.java:102)
 ... 33 moreCaused by: com.amazonaws.SdkClientException: Unable to execute HTTP 
request: Timeout waiting for connection from pool at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1113)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1063)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
 at 
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4229) at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4176) at 
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1253)
 at 
org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1053)
 at 
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1841) 
... 39 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1496) Seek Error when querying MOR tables in GCP

2020-12-28 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1496:


Assignee: sivabalan narayanan

> Seek Error when querying MOR tables in GCP
> --
>
> Key: HUDI-1496
> URL: https://issues.apache.org/jira/browse/HUDI-1496
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: sivabalan narayanan
>Priority: Major
>
> Context : [https://github.com/apache/hudi/issues/2367]
> FSUtils.isGCSInputStream is not catching all the cases when reading from GCS. 
> IN some cases 
> ([https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76),|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76)]
>  the condition in isGCSInputStream breaks
>  
> Instead of isGCSInputStream, we should detect GCSFileSystem by checking if 
> the filesystem scheme is StorageSchemes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1496) Seek Error when querying MOR tables in GCP

2020-12-28 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1496:
-
Status: Open  (was: New)

> Seek Error when querying MOR tables in GCP
> --
>
> Key: HUDI-1496
> URL: https://issues.apache.org/jira/browse/HUDI-1496
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Priority: Major
>
> Context : [https://github.com/apache/hudi/issues/2367]
> FSUtils.isGCSInputStream is not catching all the cases when reading from GCS. 
> IN some cases 
> ([https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76),|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76)]
>  the condition in isGCSInputStream breaks
>  
> Instead of isGCSInputStream, we should detect GCSFileSystem by checking if 
> the filesystem scheme is StorageSchemes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1496) Seek Error when querying MOR tables in GCP

2020-12-28 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1496:


 Summary: Seek Error when querying MOR tables in GCP
 Key: HUDI-1496
 URL: https://issues.apache.org/jira/browse/HUDI-1496
 Project: Apache Hudi
  Issue Type: Bug
  Components: Common Core
Reporter: Balaji Varadarajan


Context : [https://github.com/apache/hudi/issues/2367]

FSUtils.isGCSInputStream is not catching all the cases when reading from GCS. 
IN some cases 
([https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76),|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java#L76)]
 the condition in isGCSInputStream breaks

 

Instead of isGCSInputStream, we should detect GCSFileSystem by checking if the 
filesystem scheme is StorageSchemes.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1490) Incremental Query fails if there are partitions that have no incremental changes

2020-12-23 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1490:
-
Status: Open  (was: New)

> Incremental Query fails if there are partitions that have no incremental 
> changes
> 
>
> Key: HUDI-1490
> URL: https://issues.apache.org/jira/browse/HUDI-1490
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.7.0
>
>
> Context: https://github.com/apache/hudi/issues/2362



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1490) Incremental Query fails if there are partitions that have no incremental changes

2020-12-23 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1490:


 Summary: Incremental Query fails if there are partitions that have 
no incremental changes
 Key: HUDI-1490
 URL: https://issues.apache.org/jira/browse/HUDI-1490
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Reporter: Balaji Varadarajan
 Fix For: 0.7.0


Context: https://github.com/apache/hudi/issues/2362



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1475) Fix documentation of preCombine to clarify when this API is used by Hudi

2020-12-18 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17252023#comment-17252023
 ] 

Balaji Varadarajan commented on HUDI-1475:
--

Relevant Issue: https://github.com/apache/hudi/issues/2345

> Fix documentation of preCombine to clarify when this API is used by Hudi 
> -
>
> Key: HUDI-1475
> URL: https://issues.apache.org/jira/browse/HUDI-1475
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.7.0
>
>
> We need to fix the Javadoc of preCombine in HoodieRecordPayload to clarify 
> that this method is used to pre-merge  unmerged (compaction) and incoming 
> records before the merge with existing record in the dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1475) Fix documentation of preCombine to clarify when this API is used by Hudi

2020-12-18 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1475:


 Summary: Fix documentation of preCombine to clarify when this API 
is used by Hudi 
 Key: HUDI-1475
 URL: https://issues.apache.org/jira/browse/HUDI-1475
 Project: Apache Hudi
  Issue Type: Task
  Components: Docs
Reporter: Balaji Varadarajan
 Fix For: 0.7.0


We need to fix the Javadoc of preCombine in HoodieRecordPayload to clarify that 
this method is used to pre-merge  unmerged (compaction) and incoming records 
before the merge with existing record in the dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1475) Fix documentation of preCombine to clarify when this API is used by Hudi

2020-12-18 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1475:
-
Status: Open  (was: New)

> Fix documentation of preCombine to clarify when this API is used by Hudi 
> -
>
> Key: HUDI-1475
> URL: https://issues.apache.org/jira/browse/HUDI-1475
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.7.0
>
>
> We need to fix the Javadoc of preCombine in HoodieRecordPayload to clarify 
> that this method is used to pre-merge  unmerged (compaction) and incoming 
> records before the merge with existing record in the dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1452) RocksDB FileSystemView throwing NotSerializableError when embedded timeline server is turned off

2020-12-10 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1452:
-
Description: 
[https://github.com/apache/hudi/issues/2321]

 

We need to make RocksDBFileSystemView lazy initializable so that it would 
seamlessly when run in executor.

  was:
[https://github.com/apache/hudi/issues/2321]

 

We need to make 


> RocksDB FileSystemView throwing NotSerializableError when embedded timeline 
> server is turned off
> 
>
> Key: HUDI-1452
> URL: https://issues.apache.org/jira/browse/HUDI-1452
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Sreeram Ramji
>Priority: Major
>
> [https://github.com/apache/hudi/issues/2321]
>  
> We need to make RocksDBFileSystemView lazy initializable so that it would 
> seamlessly when run in executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1452) RocksDB FileSystemView throwing NotSerializableError when embedded timeline server is turned off

2020-12-10 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1452:
-
Description: 
[https://github.com/apache/hudi/issues/2321]

 

We need to make 

  was:https://github.com/apache/hudi/issues/2321


> RocksDB FileSystemView throwing NotSerializableError when embedded timeline 
> server is turned off
> 
>
> Key: HUDI-1452
> URL: https://issues.apache.org/jira/browse/HUDI-1452
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Sreeram Ramji
>Priority: Major
>
> [https://github.com/apache/hudi/issues/2321]
>  
> We need to make 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1452) RocksDB FileSystemView throwing NotSerializableError when embedded timeline server is turned off

2020-12-10 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1452:
-
Status: Open  (was: New)

> RocksDB FileSystemView throwing NotSerializableError when embedded timeline 
> server is turned off
> 
>
> Key: HUDI-1452
> URL: https://issues.apache.org/jira/browse/HUDI-1452
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Sreeram Ramji
>Priority: Major
>
> https://github.com/apache/hudi/issues/2321



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1452) RocksDB FileSystemView throwing NotSerializableError when embedded timeline server is turned off

2020-12-10 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1452:


 Summary: RocksDB FileSystemView throwing NotSerializableError when 
embedded timeline server is turned off
 Key: HUDI-1452
 URL: https://issues.apache.org/jira/browse/HUDI-1452
 Project: Apache Hudi
  Issue Type: Bug
  Components: Common Core
Reporter: Balaji Varadarajan


https://github.com/apache/hudi/issues/2321



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1452) RocksDB FileSystemView throwing NotSerializableError when embedded timeline server is turned off

2020-12-10 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1452:


Assignee: Sreeram Ramji

> RocksDB FileSystemView throwing NotSerializableError when embedded timeline 
> server is turned off
> 
>
> Key: HUDI-1452
> URL: https://issues.apache.org/jira/browse/HUDI-1452
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Sreeram Ramji
>Priority: Major
>
> https://github.com/apache/hudi/issues/2321



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1440) Allow option to override schema when doing spark.write

2020-12-08 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1440:
-
Status: Open  (was: New)

> Allow option to override schema when doing spark.write
> --
>
> Key: HUDI-1440
> URL: https://issues.apache.org/jira/browse/HUDI-1440
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.8.0
>
>
> Need ability to pass schema and use it to create RDD when creating input 
> batch from data-frame. 
>  
> df.write.format("hudi").option("hudi.avro.schema", "")..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1440) Allow option to override schema when doing spark.write

2020-12-08 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1440:


 Summary: Allow option to override schema when doing spark.write
 Key: HUDI-1440
 URL: https://issues.apache.org/jira/browse/HUDI-1440
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Spark Integration
Reporter: Balaji Varadarajan
 Fix For: 0.8.0


Need ability to pass schema and use it to create RDD when creating input batch 
from data-frame. 

 

df.write.format("hudi").option("hudi.avro.schema", "")..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1436) Provide Option to run auto clean every nth commit.

2020-12-07 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1436:


 Summary: Provide Option to run auto clean every nth commit. 
 Key: HUDI-1436
 URL: https://issues.apache.org/jira/browse/HUDI-1436
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Cleaner
Reporter: Balaji Varadarajan
 Fix For: 0.7.0


Need mechanism (just like compaction scheduling (

hoodie.compact.inline.max.delta.commits)) which lets scheduling clean every nth 
commit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1436) Provide Option to run auto clean every nth commit.

2020-12-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1436:
-
Status: Open  (was: New)

> Provide Option to run auto clean every nth commit. 
> ---
>
> Key: HUDI-1436
> URL: https://issues.apache.org/jira/browse/HUDI-1436
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Assignee: Sreeram Ramji
>Priority: Major
> Fix For: 0.7.0
>
>
> Need mechanism (just like compaction scheduling (
> hoodie.compact.inline.max.delta.commits)) which lets scheduling clean every 
> nth commit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1436) Provide Option to run auto clean every nth commit.

2020-12-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1436:


Assignee: Sreeram Ramji

> Provide Option to run auto clean every nth commit. 
> ---
>
> Key: HUDI-1436
> URL: https://issues.apache.org/jira/browse/HUDI-1436
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Assignee: Sreeram Ramji
>Priority: Major
> Fix For: 0.7.0
>
>
> Need mechanism (just like compaction scheduling (
> hoodie.compact.inline.max.delta.commits)) which lets scheduling clean every 
> nth commit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present

2020-12-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1435:
-
Status: Patch Available  (was: In Progress)

> Marker File Reconciliation failing for Non-Partitioned datasets when 
> duplicate marker files present
> ---
>
> Key: HUDI-1435
> URL: https://issues.apache.org/jira/browse/HUDI-1435
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> GH : https://github.com/apache/hudi/issues/2294



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present

2020-12-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1435:
-
Status: Open  (was: New)

> Marker File Reconciliation failing for Non-Partitioned datasets when 
> duplicate marker files present
> ---
>
> Key: HUDI-1435
> URL: https://issues.apache.org/jira/browse/HUDI-1435
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> GH : https://github.com/apache/hudi/issues/2294



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present

2020-12-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1435:
-
Status: In Progress  (was: Open)

> Marker File Reconciliation failing for Non-Partitioned datasets when 
> duplicate marker files present
> ---
>
> Key: HUDI-1435
> URL: https://issues.apache.org/jira/browse/HUDI-1435
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> GH : https://github.com/apache/hudi/issues/2294



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present

2020-12-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1435:
-
Status: In Progress  (was: Open)

> Marker File Reconciliation failing for Non-Partitioned datasets when 
> duplicate marker files present
> ---
>
> Key: HUDI-1435
> URL: https://issues.apache.org/jira/browse/HUDI-1435
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> GH : https://github.com/apache/hudi/issues/2294



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present

2020-12-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1435:
-
Summary: Marker File Reconciliation failing for Non-Partitioned datasets 
when duplicate marker files present  (was: Marker File Reconciliation failing 
for Non-Partitioned Paths when duplicate marker files present)

> Marker File Reconciliation failing for Non-Partitioned datasets when 
> duplicate marker files present
> ---
>
> Key: HUDI-1435
> URL: https://issues.apache.org/jira/browse/HUDI-1435
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.7.0
>
>
> GH : https://github.com/apache/hudi/issues/2294



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned Paths when duplicate marker files present

2020-12-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1435:


Assignee: Balaji Varadarajan

> Marker File Reconciliation failing for Non-Partitioned Paths when duplicate 
> marker files present
> 
>
> Key: HUDI-1435
> URL: https://issues.apache.org/jira/browse/HUDI-1435
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.7.0
>
>
> GH : https://github.com/apache/hudi/issues/2294



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned Paths when duplicate marker files present

2020-12-07 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1435:


 Summary: Marker File Reconciliation failing for Non-Partitioned 
Paths when duplicate marker files present
 Key: HUDI-1435
 URL: https://issues.apache.org/jira/browse/HUDI-1435
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Common Core
Reporter: Balaji Varadarajan
 Fix For: 0.7.0


GH : https://github.com/apache/hudi/issues/2294



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1329) Support async compaction in spark DF write()

2020-12-03 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243675#comment-17243675
 ] 

Balaji Varadarajan commented on HUDI-1329:
--

[~309637554]: This API allows only running compaction. Note that there is no 
input dataframe to be ingested. You can create a dummy dataframe if needed but 
the operation does not have to care about input DF. It only needs to run 
compaction (specific compaction id if provided by user) or the oldest one if 
not provided,

> Support async compaction in spark DF write()
> 
>
> Key: HUDI-1329
> URL: https://issues.apache.org/jira/browse/HUDI-1329
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.7.0
>
>
> spark.write().format("hudi").option(operation, "run_compact") to run 
> compaction
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1413) Need binary release of Hudi to distribute tools like hudi-cli.sh and hudi-sync

2020-11-23 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1413:
-
Fix Version/s: 0.7.0

> Need binary release of Hudi to distribute tools like hudi-cli.sh and hudi-sync
> --
>
> Key: HUDI-1413
> URL: https://issues.apache.org/jira/browse/HUDI-1413
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.7.0
>
>
> GH issue : https://github.com/apache/hudi/issues/2270



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1413) Need binary release of Hudi to distribute tools like hudi-cli.sh and hudi-sync

2020-11-23 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1413:


 Summary: Need binary release of Hudi to distribute tools like 
hudi-cli.sh and hudi-sync
 Key: HUDI-1413
 URL: https://issues.apache.org/jira/browse/HUDI-1413
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Usability
Reporter: Balaji Varadarajan


GH issue : https://github.com/apache/hudi/issues/2270



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1413) Need binary release of Hudi to distribute tools like hudi-cli.sh and hudi-sync

2020-11-23 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1413:
-
Status: Open  (was: New)

> Need binary release of Hudi to distribute tools like hudi-cli.sh and hudi-sync
> --
>
> Key: HUDI-1413
> URL: https://issues.apache.org/jira/browse/HUDI-1413
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Balaji Varadarajan
>Priority: Major
>
> GH issue : https://github.com/apache/hudi/issues/2270



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1395) HoodieSnapshotCopier not working on non-partitioned datasets

2020-11-12 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1395:


 Summary: HoodieSnapshotCopier not working on non-partitioned 
datasets
 Key: HUDI-1395
 URL: https://issues.apache.org/jira/browse/HUDI-1395
 Project: Apache Hudi
  Issue Type: Bug
  Components: Utilities
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


https://github.com/apache/hudi/issues/2244



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1395) HoodieSnapshotCopier not working on non-partitioned datasets

2020-11-12 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1395:
-
Status: Open  (was: New)

> HoodieSnapshotCopier not working on non-partitioned datasets
> 
>
> Key: HUDI-1395
> URL: https://issues.apache.org/jira/browse/HUDI-1395
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Utilities
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> https://github.com/apache/hudi/issues/2244



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1205) Serialization fail when log file is larger than 2GB

2020-11-11 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230386#comment-17230386
 ] 

Balaji Varadarajan commented on HUDI-1205:
--

[~leehuynh] [~zuyanton] [~garyli1019] Please see the above comment and try 
master. 

> Serialization fail when log file is larger than 2GB
> ---
>
> Key: HUDI-1205
> URL: https://issues.apache.org/jira/browse/HUDI-1205
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>
> When scanning the log file, if the log file(or log file group) is larger than 
> 2GB, serialization will fail because Hudi uses Integer to store size in byte 
> for the log file. The maximum integer representing bytes is 2GB.
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> Serialization trace:
> orderingVal (org.apache.hudi.common.model.OverwriteWithLatestAvroPayload)
> data (org.apache.hudi.common.model.HoodieRecord)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
> at 
> org.apache.hudi.common.util.SerializationUtils$KryoSerializerInstance.deserialize(SerializationUtils.java:107)
> at 
> org.apache.hudi.common.util.SerializationUtils.deserialize(SerializationUtils.java:81)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:217)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:211)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:207)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:168)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:55)
> at 
> org.apache.hudi.HoodieMergeOnReadRDD$$anon$1.hasNext(HoodieMergeOnReadRDD.scala:128)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:154)
> ... 31 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1205) Serialization fail when log file is larger than 2GB

2020-11-11 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230385#comment-17230385
 ] 

Balaji Varadarajan commented on HUDI-1205:
--

This is likely fixed as part of 
[https://github.com/apache/hudi/commit/b335459c805748815ccc858ff1a9ef4cd830da8c]

Ref: [https://github.com/apache/hudi/issues/2237]

Will close once it is confirmed. 

> Serialization fail when log file is larger than 2GB
> ---
>
> Key: HUDI-1205
> URL: https://issues.apache.org/jira/browse/HUDI-1205
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>
> When scanning the log file, if the log file(or log file group) is larger than 
> 2GB, serialization will fail because Hudi uses Integer to store size in byte 
> for the log file. The maximum integer representing bytes is 2GB.
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> Serialization trace:
> orderingVal (org.apache.hudi.common.model.OverwriteWithLatestAvroPayload)
> data (org.apache.hudi.common.model.HoodieRecord)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
> at 
> org.apache.hudi.common.util.SerializationUtils$KryoSerializerInstance.deserialize(SerializationUtils.java:107)
> at 
> org.apache.hudi.common.util.SerializationUtils.deserialize(SerializationUtils.java:81)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:217)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:211)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:207)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:168)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:55)
> at 
> org.apache.hudi.HoodieMergeOnReadRDD$$anon$1.hasNext(HoodieMergeOnReadRDD.scala:128)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:154)
> ... 31 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1383) Incorrect partitions getting hive synced

2020-11-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1383:
-
Status: Open  (was: New)

> Incorrect partitions getting hive synced
> 
>
> Key: HUDI-1383
> URL: https://issues.apache.org/jira/browse/HUDI-1383
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> https://github.com/apache/hudi/issues/2234



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1383) Incorrect partitions getting hive synced

2020-11-09 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1383:


 Summary: Incorrect partitions getting hive synced
 Key: HUDI-1383
 URL: https://issues.apache.org/jira/browse/HUDI-1383
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


https://github.com/apache/hudi/issues/2234



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1381) Schedule compaction based on time elapsed

2020-11-09 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1381:


 Summary: Schedule compaction based on time elapsed 
 Key: HUDI-1381
 URL: https://issues.apache.org/jira/browse/HUDI-1381
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Compaction
Reporter: Balaji Varadarajan
 Fix For: 0.7.0


GH : [https://github.com/apache/hudi/issues/2229]

It would be helpful to introduce configuration to schedule compaction based on 
time elapsed since last scheduled compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1381) Schedule compaction based on time elapsed

2020-11-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1381:
-
Status: Open  (was: New)

> Schedule compaction based on time elapsed 
> --
>
> Key: HUDI-1381
> URL: https://issues.apache.org/jira/browse/HUDI-1381
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Compaction
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.7.0
>
>
> GH : [https://github.com/apache/hudi/issues/2229]
> It would be helpful to introduce configuration to schedule compaction based 
> on time elapsed since last scheduled compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted

2020-11-05 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1309:
-
Status: Open  (was: New)

> Listing Metadata unreadable in S3 as the log block is deemed corrupted
> --
>
> Key: HUDI-1309
> URL: https://issues.apache.org/jira/browse/HUDI-1309
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
>
> When running metadata list-partitions CLI command, I am seeing the below 
> messages and the partition list is empty. Was expecting 10K partitions.
>  
> {code:java}
>  36589 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning 
> log file 
> HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0}
>  36590 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block 
> in file 
> HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} with block size(3723305) running past EOF
>  36684 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Log 
> HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} has a corrupted block at 14
>  44515 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block 
> in 
> HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} starts at 3723319
>  44566 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a 
> corrupt block in 
> s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045
>  44567 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1365) Listing leaf files and directories is very Slow

2020-11-03 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225477#comment-17225477
 ] 

Balaji Varadarajan commented on HUDI-1365:
--

https://github.com/apache/hudi/commit/9a1f698eef103adadbf7a1bf7b5eb94fb84e

> Listing leaf files and directories is very Slow
> ---
>
> Key: HUDI-1365
> URL: https://issues.apache.org/jira/browse/HUDI-1365
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Selvaraj periyasamy
>Priority: Major
> Attachments: image-2020-11-01-01-11-11-561.png, image.png
>
>
> I am using huh 0.5.0 . I took 0.5.0 and used the changes for 
> HoodieROTablePathFilter from HUDI-1144.  Even though it caches, I am seeing 
> only 46 directories cached in 1 min. Due to this, My job takes lot of time to 
> write. because I have 6 months worth of hourly partitions. Is there a way to 
> speed up? I am running it in production cluster and have enough Vcores 
> available to process.
>  
> HoodieTableMetaClient metaClient = metaClientCache.get(baseDir.toString());
>  if (null == metaClient)
> { metaClient = new HoodieTableMetaClient(fs.getConf(), baseDir.toString(), 
> true); metaClientCache.put(baseDir.toString(), metaClient); }
> HoodieTableFileSystemView fsView = new HoodieTableFileSystemView(metaClient,
>  
> metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(),
>  fs.listStatus(folder));
>  List latestFiles = 
> fsView.getLatestDataFiles().collect(Collectors.toList());
>  // populate the cache
>  if (!hoodiePathCache.containsKey(folder.toString()))
> { hoodiePathCache.put(folder.toString(), new HashSet<>()); }
> LOG.info("Custom Code : Based on hoodie metadata from base path: " + 
> baseDir.toString() + ", caching " + latestFiles.size()
>  + " files under " + folder);
>  for (HoodieDataFile lfile : latestFiles)
> { hoodiePathCache.get(folder.toString()).add(new Path(lfile.getPath())); }
>  
>  
>  
> Sample Logs here. I have attached the log file as well.
>  
> 20/11/01 08:16:00 INFO HoodieTableFileSystemView: Adding file-groups for 
> partition :20200919/08, #FileGroups=2
>  20/11/01 08:16:00 INFO AbstractTableFileSystemView: addFilesToView: 
> NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
>  20/11/01 08:16:00 INFO HoodieROTablePathFilter: Custom Code : Based on 
> hoodie metadata from base path: 
> hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files 
> under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/08
>  20/11/01 08:16:01 WARN LoadBalancingKMSClientProvider: KMS provider at  
> threw an IOException!! java.io.IOException: 
> org.apache.hadoop.security.authentication.client.AuthenticationException: 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)
>  20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for 
> partition :20200919/09, #FileGroups=2
>  20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: 
> NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
>  20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on 
> hoodie metadata from base path: 
> hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files 
> under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/09
>  20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for 
> partition :20200919/10, #FileGroups=3
>  20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: 
> NumFiles=10, FileGroupsCreationTime=1, StoreTimeTaken=0
>  20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on 
> hoodie metadata from base path: 
> hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files 
> under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/10
>  20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at  
> threw an IOException!! java.io.IOException: 
> org.apache.hadoop.security.authentication.client.AuthenticationException: 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)
>  20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at  
> threw an IOException!! java.io.IOException: 
> org.apache.hadoop.security.authentication.client.AuthenticationException: 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)
>  20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for 
> partition :20200919/11, #FileGroups=2
>  20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: 
> NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
>  20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on 
> hoodie metadata from base path: 
> 

[jira] [Commented] (HUDI-1365) Listing leaf files and directories is very Slow

2020-11-02 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224750#comment-17224750
 ] 

Balaji Varadarajan commented on HUDI-1365:
--

[~Selvaraj.periyasamy1983]: 0.5.0 is a very old version of Hudi. You should try 
moving to later versions as there are other improvements like removing "rename" 
operations in them. W.r.t your performance, I see lot of WARN level logs with 
exceptions getting caught. I am wondering if this is due to misconfiguration 
and is slowing your query. 

On a different note, we are going to support feature in the next release which 
would  avoid listing data partitions completely 
(https://issues.apache.org/jira/browse/HUDI-1292).  

> Listing leaf files and directories is very Slow
> ---
>
> Key: HUDI-1365
> URL: https://issues.apache.org/jira/browse/HUDI-1365
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Selvaraj periyasamy
>Priority: Major
> Attachments: Log.txt, image-2020-11-01-01-11-11-561.png
>
>
> I am using huh 0.5.0 . I took 0.5.0 and used the changes for 
> HoodieROTablePathFilter from HUDI-1144.  Even though it caches, I am seeing 
> only 46 directories cached in 1 min. Due to this, My job takes lot of time to 
> write. because I have 6 months worth of hourly partitions. Is there a way to 
> speed up? I am running it in production cluster and have enough Vcores 
> available to process.
>  
> HoodieTableMetaClient metaClient = metaClientCache.get(baseDir.toString());
>  if (null == metaClient)
> { metaClient = new HoodieTableMetaClient(fs.getConf(), baseDir.toString(), 
> true); metaClientCache.put(baseDir.toString(), metaClient); }
> HoodieTableFileSystemView fsView = new HoodieTableFileSystemView(metaClient,
>  
> metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(),
>  fs.listStatus(folder));
>  List latestFiles = 
> fsView.getLatestDataFiles().collect(Collectors.toList());
>  // populate the cache
>  if (!hoodiePathCache.containsKey(folder.toString()))
> { hoodiePathCache.put(folder.toString(), new HashSet<>()); }
> LOG.info("Custom Code : Based on hoodie metadata from base path: " + 
> baseDir.toString() + ", caching " + latestFiles.size()
>  + " files under " + folder);
>  for (HoodieDataFile lfile : latestFiles)
> { hoodiePathCache.get(folder.toString()).add(new Path(lfile.getPath())); }
>  
>  
>  
> Sample Logs here. I have attached the log file as well.
>  
> 20/11/01 08:16:00 INFO HoodieTableFileSystemView: Adding file-groups for 
> partition :20200919/08, #FileGroups=2
>  20/11/01 08:16:00 INFO AbstractTableFileSystemView: addFilesToView: 
> NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
>  20/11/01 08:16:00 INFO HoodieROTablePathFilter: Custom Code : Based on 
> hoodie metadata from base path: 
> hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files 
> under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/08
>  20/11/01 08:16:01 WARN LoadBalancingKMSClientProvider: KMS provider at 
> [http://sl73caehmpc1010.visa.com:9292/kms/v1/] threw an IOException!! 
> java.io.IOException: 
> org.apache.hadoop.security.authentication.client.AuthenticationException: 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)
>  20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for 
> partition :20200919/09, #FileGroups=2
>  20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: 
> NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
>  20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on 
> hoodie metadata from base path: 
> hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files 
> under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/09
>  20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for 
> partition :20200919/10, #FileGroups=3
>  20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: 
> NumFiles=10, FileGroupsCreationTime=1, StoreTimeTaken=0
>  20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on 
> hoodie metadata from base path: 
> hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files 
> under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/10
>  20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at 
> [http://sl73caehmpc1009.visa.com:9292/kms/v1/] threw an IOException!! 
> java.io.IOException: 
> org.apache.hadoop.security.authentication.client.AuthenticationException: 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)
>  20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at 
> [http://sl73caehmpc1010.visa.com:9292/kms/v1/] threw an IOException!! 
> 

[jira] [Created] (HUDI-1368) Merge On Read Snapshot Reader not working for Databricks on ADLS Gen2

2020-11-02 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1368:


 Summary: Merge On Read Snapshot Reader not working for Databricks 
on ADLS Gen2
 Key: HUDI-1368
 URL: https://issues.apache.org/jira/browse/HUDI-1368
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Spark Integration
Reporter: Balaji Varadarajan
 Fix For: 0.7.0


https://github.com/apache/hudi/issues/2180



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1368) Merge On Read Snapshot Reader not working for Databricks on ADLS Gen2

2020-11-02 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1368:
-
Status: Open  (was: New)

> Merge On Read Snapshot Reader not working for Databricks on ADLS Gen2
> -
>
> Key: HUDI-1368
> URL: https://issues.apache.org/jira/browse/HUDI-1368
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: adls
> Fix For: 0.7.0
>
>
> https://github.com/apache/hudi/issues/2180



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1363) Provide Option to drop columns after they are used to generate partition or record keys

2020-10-30 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1363:
-
Status: Open  (was: New)

> Provide Option to drop columns after they are used to generate partition or 
> record keys
> ---
>
> Key: HUDI-1363
> URL: https://issues.apache.org/jira/browse/HUDI-1363
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.7.0
>
>
> Context: https://github.com/apache/hudi/issues/2213



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1363) Provide Option to drop columns after they are used to generate partition or record keys

2020-10-30 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1363:


 Summary: Provide Option to drop columns after they are used to 
generate partition or record keys
 Key: HUDI-1363
 URL: https://issues.apache.org/jira/browse/HUDI-1363
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Writer Core
Reporter: Balaji Varadarajan
 Fix For: 0.7.0


Context: https://github.com/apache/hudi/issues/2213



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1358) Memory Leak in HoodieLogFormatWriter

2020-10-29 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1358:


Assignee: Balaji Varadarajan

> Memory Leak in HoodieLogFormatWriter
> 
>
> Key: HUDI-1358
> URL: https://issues.apache.org/jira/browse/HUDI-1358
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>
> https://github.com/apache/hudi/issues/2215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1358) Memory Leak in HoodieLogFormatWriter

2020-10-29 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1358:


 Summary: Memory Leak in HoodieLogFormatWriter
 Key: HUDI-1358
 URL: https://issues.apache.org/jira/browse/HUDI-1358
 Project: Apache Hudi
  Issue Type: Bug
  Components: Writer Core
Reporter: Balaji Varadarajan


https://github.com/apache/hudi/issues/2215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1358) Memory Leak in HoodieLogFormatWriter

2020-10-29 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1358:
-
Status: Open  (was: New)

> Memory Leak in HoodieLogFormatWriter
> 
>
> Key: HUDI-1358
> URL: https://issues.apache.org/jira/browse/HUDI-1358
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
>
> https://github.com/apache/hudi/issues/2215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1350) Support Partition level delete API in HUDI on top on Insert Overwrite

2020-10-23 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219950#comment-17219950
 ] 

Balaji Varadarajan commented on HUDI-1350:
--

Yes, [~309637554]: You can change the API to take in a list of partitions. 

 

At the spark datasource, CLI level, you can accept a path blob for partitions 
to be deleted.

 

> Support Partition level delete API in HUDI on top on Insert Overwrite
> -
>
> Key: HUDI-1350
> URL: https://issues.apache.org/jira/browse/HUDI-1350
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1340) Not able to query real time table when rows contains nested elements

2020-10-22 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219370#comment-17219370
 ] 

Balaji Varadarajan commented on HUDI-1340:
--

This is likely related to parquet (serde and related library) versions 
difference between the write  and read (parquet-hive) side.

> Not able to query real time table when rows contains nested elements
> 
>
> Key: HUDI-1340
> URL: https://issues.apache.org/jira/browse/HUDI-1340
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Bharat Dighe
>Priority: Major
> Attachments: create_avro.py, user.avsc, users1.avro, users2.avro, 
> users3.avro, users4.avro, users5.avro
>
>
> AVRO schema: Attached
> Script to generate sample data: attached
> Sample data attached
> ==
> the schema as nested elements, here is the output from hive
> {code:java}
>   CREATE EXTERNAL TABLE `users_mor_rt`( 
>  `_hoodie_commit_time` string, 
>  `_hoodie_commit_seqno` string, 
>  `_hoodie_record_key` string, 
>  `_hoodie_partition_path` string, 
>  `_hoodie_file_name` string, 
>  `name` string, 
>  `userid` int, 
>  `datehired` string, 
>  `meta` struct, 
>  `experience` 
> struct>>) 
>  PARTITIONED BY ( 
>  `role` string) 
>  ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
>  STORED AS INPUTFORMAT 
>  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' 
>  OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
>  LOCATION 
>  'hdfs://namenode:8020/tmp/hudi_repair_order_mor' 
>  TBLPROPERTIES ( 
>  'last_commit_time_sync'='20201011190954', 
>  'transient_lastDdlTime'='1602442906')
> {code}
> scala  code:
> {code:java}
> import java.io.File
> import org.apache.hudi.QuickstartUtils._
> import org.apache.spark.sql.SaveMode._
> import org.apache.avro.Schema
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._
> val tableName = "users_mor"
> //  val basePath = "hdfs:///tmp/hudi_repair_order_mor"
> val basePath = "hdfs:///tmp/hudi_repair_order_mor"
> //  Insert Data
> /// local not hdfs !!!
> //val schema = new Schema.Parser().parse(new 
> File("/var/hoodie/ws/docker/demo/data/user/user.avsc"))
> def updateHudi( num:String, op:String) = {
> val path = "hdfs:///var/demo/data/user/users" + num + ".avro"
> println( path );
> val avdf2 =  new org.apache.spark.sql.SQLContext(sc).read.format("avro").
> // option("avroSchema", schema.toString).
> load(path)
> avdf2.select("name").show(false)
> avdf2.write.format("hudi").
> options(getQuickstartWriteConfigs).
> option(OPERATION_OPT_KEY,op).
> option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ"). // 
> default:COPY_ON_WRITE, MERGE_ON_READ
> option(KEYGENERATOR_CLASS_OPT_KEY, 
> "org.apache.hudi.keygen.ComplexKeyGenerator").
> option(PRECOMBINE_FIELD_OPT_KEY, "meta.ingestTime").   // dedup
> option(RECORDKEY_FIELD_OPT_KEY, "userId").   // key
> option(PARTITIONPATH_FIELD_OPT_KEY, "role").
> option(TABLE_NAME, tableName).
> option("hoodie.compact.inline", false).
> option(HIVE_STYLE_PARTITIONING_OPT_KEY, "true").
> option(HIVE_SYNC_ENABLED_OPT_KEY, "true").
> option(HIVE_TABLE_OPT_KEY, tableName).
> option(HIVE_USER_OPT_KEY, "hive").
> option(HIVE_PASS_OPT_KEY, "hive").
> option(HIVE_URL_OPT_KEY, "jdbc:hive2://hiveserver:1").
> option(HIVE_PARTITION_FIELDS_OPT_KEY, "role").
> option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor").
> option("hoodie.datasource.hive_sync.assume_date_partitioning", 
> "false").
> mode(Append).
> save(basePath)
> spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, experience.companies[0] from " + tableName + 
> "_rt").show()
> spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, _hoodie_commit_seqno from " + tableName + 
> "_ro").show()
> }
> updateHudi("1", "bulkinsert")
> updateHudi("2", "upsert")
> updateHudi("3", "upsert")
> updateHudi("4", "upsert")
> {code}
> If nested fields are not included, it works fine
> {code}
> scala> spark.sql("select name from users_mor_rt");
> res19: org.apache.spark.sql.DataFrame = [name: string]
> scala> spark.sql("select name from users_mor_rt").show();
> +-+
> | name|
> +-+
> |engg3|
> |engg1_new|
> |engg2_new|
> | mgr1|
> | mgr2|
> |  devops1|
> |  devops2|
> +-+
> {code}
> But fails when I include nested field 'experience'
> {code}
> scala> spark.sql("select 

[jira] [Commented] (HUDI-1340) Not able to query real time table when rows contains nested elements

2020-10-19 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216913#comment-17216913
 ] 

Balaji Varadarajan commented on HUDI-1340:
--

[~bdighe]: Did you use --conf spark.sql.hive.convertMetastoreParquet=false when 
you started your spark-shell where you are running the query ?

 

https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-Whydowehavetoset2differentwaysofconfiguringSparktoworkwithHudi?

> Not able to query real time table when rows contains nested elements
> 
>
> Key: HUDI-1340
> URL: https://issues.apache.org/jira/browse/HUDI-1340
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Bharat Dighe
>Priority: Major
> Attachments: create_avro.py, user.avsc, users1.avro, users2.avro, 
> users3.avro, users4.avro, users5.avro
>
>
> AVRO schema: Attached
> Script to generate sample data: attached
> Sample data attached
> ==
> the schema as nested elements, here is the output from hive
> {code:java}
>   CREATE EXTERNAL TABLE `users_mor_rt`( 
>  `_hoodie_commit_time` string, 
>  `_hoodie_commit_seqno` string, 
>  `_hoodie_record_key` string, 
>  `_hoodie_partition_path` string, 
>  `_hoodie_file_name` string, 
>  `name` string, 
>  `userid` int, 
>  `datehired` string, 
>  `meta` struct, 
>  `experience` 
> struct>>) 
>  PARTITIONED BY ( 
>  `role` string) 
>  ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
>  STORED AS INPUTFORMAT 
>  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' 
>  OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
>  LOCATION 
>  'hdfs://namenode:8020/tmp/hudi_repair_order_mor' 
>  TBLPROPERTIES ( 
>  'last_commit_time_sync'='20201011190954', 
>  'transient_lastDdlTime'='1602442906')
> {code}
> scala  code:
> {code:java}
> import java.io.File
> import org.apache.hudi.QuickstartUtils._
> import org.apache.spark.sql.SaveMode._
> import org.apache.avro.Schema
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._
> val tableName = "users_mor"
> //  val basePath = "hdfs:///tmp/hudi_repair_order_mor"
> val basePath = "hdfs:///tmp/hudi_repair_order_mor"
> //  Insert Data
> /// local not hdfs !!!
> //val schema = new Schema.Parser().parse(new 
> File("/var/hoodie/ws/docker/demo/data/user/user.avsc"))
> def updateHudi( num:String, op:String) = {
> val path = "hdfs:///var/demo/data/user/users" + num + ".avro"
> println( path );
> val avdf2 =  new org.apache.spark.sql.SQLContext(sc).read.format("avro").
> // option("avroSchema", schema.toString).
> load(path)
> avdf2.select("name").show(false)
> avdf2.write.format("hudi").
> options(getQuickstartWriteConfigs).
> option(OPERATION_OPT_KEY,op).
> option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ"). // 
> default:COPY_ON_WRITE, MERGE_ON_READ
> option(KEYGENERATOR_CLASS_OPT_KEY, 
> "org.apache.hudi.keygen.ComplexKeyGenerator").
> option(PRECOMBINE_FIELD_OPT_KEY, "meta.ingestTime").   // dedup
> option(RECORDKEY_FIELD_OPT_KEY, "userId").   // key
> option(PARTITIONPATH_FIELD_OPT_KEY, "role").
> option(TABLE_NAME, tableName).
> option("hoodie.compact.inline", false).
> option(HIVE_STYLE_PARTITIONING_OPT_KEY, "true").
> option(HIVE_SYNC_ENABLED_OPT_KEY, "true").
> option(HIVE_TABLE_OPT_KEY, tableName).
> option(HIVE_USER_OPT_KEY, "hive").
> option(HIVE_PASS_OPT_KEY, "hive").
> option(HIVE_URL_OPT_KEY, "jdbc:hive2://hiveserver:1").
> option(HIVE_PARTITION_FIELDS_OPT_KEY, "role").
> option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor").
> option("hoodie.datasource.hive_sync.assume_date_partitioning", 
> "false").
> mode(Append).
> save(basePath)
> spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, experience.companies[0] from " + tableName + 
> "_rt").show()
> spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, _hoodie_commit_seqno from " + tableName + 
> "_ro").show()
> }
> updateHudi("1", "bulkinsert")
> updateHudi("2", "upsert")
> updateHudi("3", "upsert")
> updateHudi("4", "upsert")
> {code}
> If nested fields are not included, it works fine
> {code}
> scala> spark.sql("select name from users_mor_rt");
> res19: org.apache.spark.sql.DataFrame = [name: string]
> scala> spark.sql("select name from users_mor_rt").show();
> +-+
> | name|
> +-+
> |engg3|
> |engg1_new|
> |engg2_new|
> | mgr1|
> | mgr2|
> |  

[jira] [Updated] (HUDI-1340) Not able to query real time table when rows contains nested elements

2020-10-19 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1340:
-
Status: Open  (was: New)

> Not able to query real time table when rows contains nested elements
> 
>
> Key: HUDI-1340
> URL: https://issues.apache.org/jira/browse/HUDI-1340
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Bharat Dighe
>Priority: Major
> Attachments: create_avro.py, user.avsc, users1.avro, users2.avro, 
> users3.avro, users4.avro, users5.avro
>
>
> AVRO schema: Attached
> Script to generate sample data: attached
> Sample data attached
> ==
> the schema as nested elements, here is the output from hive
> {code:java}
>   CREATE EXTERNAL TABLE `users_mor_rt`( 
>  `_hoodie_commit_time` string, 
>  `_hoodie_commit_seqno` string, 
>  `_hoodie_record_key` string, 
>  `_hoodie_partition_path` string, 
>  `_hoodie_file_name` string, 
>  `name` string, 
>  `userid` int, 
>  `datehired` string, 
>  `meta` struct, 
>  `experience` 
> struct>>) 
>  PARTITIONED BY ( 
>  `role` string) 
>  ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
>  STORED AS INPUTFORMAT 
>  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' 
>  OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
>  LOCATION 
>  'hdfs://namenode:8020/tmp/hudi_repair_order_mor' 
>  TBLPROPERTIES ( 
>  'last_commit_time_sync'='20201011190954', 
>  'transient_lastDdlTime'='1602442906')
> {code}
> scala  code:
> {code:java}
> import java.io.File
> import org.apache.hudi.QuickstartUtils._
> import org.apache.spark.sql.SaveMode._
> import org.apache.avro.Schema
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._
> val tableName = "users_mor"
> //  val basePath = "hdfs:///tmp/hudi_repair_order_mor"
> val basePath = "hdfs:///tmp/hudi_repair_order_mor"
> //  Insert Data
> /// local not hdfs !!!
> //val schema = new Schema.Parser().parse(new 
> File("/var/hoodie/ws/docker/demo/data/user/user.avsc"))
> def updateHudi( num:String, op:String) = {
> val path = "hdfs:///var/demo/data/user/users" + num + ".avro"
> println( path );
> val avdf2 =  new org.apache.spark.sql.SQLContext(sc).read.format("avro").
> // option("avroSchema", schema.toString).
> load(path)
> avdf2.select("name").show(false)
> avdf2.write.format("hudi").
> options(getQuickstartWriteConfigs).
> option(OPERATION_OPT_KEY,op).
> option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ"). // 
> default:COPY_ON_WRITE, MERGE_ON_READ
> option(KEYGENERATOR_CLASS_OPT_KEY, 
> "org.apache.hudi.keygen.ComplexKeyGenerator").
> option(PRECOMBINE_FIELD_OPT_KEY, "meta.ingestTime").   // dedup
> option(RECORDKEY_FIELD_OPT_KEY, "userId").   // key
> option(PARTITIONPATH_FIELD_OPT_KEY, "role").
> option(TABLE_NAME, tableName).
> option("hoodie.compact.inline", false).
> option(HIVE_STYLE_PARTITIONING_OPT_KEY, "true").
> option(HIVE_SYNC_ENABLED_OPT_KEY, "true").
> option(HIVE_TABLE_OPT_KEY, tableName).
> option(HIVE_USER_OPT_KEY, "hive").
> option(HIVE_PASS_OPT_KEY, "hive").
> option(HIVE_URL_OPT_KEY, "jdbc:hive2://hiveserver:1").
> option(HIVE_PARTITION_FIELDS_OPT_KEY, "role").
> option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor").
> option("hoodie.datasource.hive_sync.assume_date_partitioning", 
> "false").
> mode(Append).
> save(basePath)
> spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, experience.companies[0] from " + tableName + 
> "_rt").show()
> spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, _hoodie_commit_seqno from " + tableName + 
> "_ro").show()
> }
> updateHudi("1", "bulkinsert")
> updateHudi("2", "upsert")
> updateHudi("3", "upsert")
> updateHudi("4", "upsert")
> {code}
> If nested fields are not included, it works fine
> {code}
> scala> spark.sql("select name from users_mor_rt");
> res19: org.apache.spark.sql.DataFrame = [name: string]
> scala> spark.sql("select name from users_mor_rt").show();
> +-+
> | name|
> +-+
> |engg3|
> |engg1_new|
> |engg2_new|
> | mgr1|
> | mgr2|
> |  devops1|
> |  devops2|
> +-+
> {code}
> But fails when I include nested field 'experience'
> {code}
> scala> spark.sql("select name, experience from users_mor_rt").show();
> 20/10/11 19:53:58 ERROR executor.Executor: Exception in task 0.0 in stage 
> 147.0 (TID 153)
> 

[jira] [Commented] (HUDI-845) Allow parallel writing and move the pending rollback work into cleaner

2020-10-16 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215226#comment-17215226
 ] 

Balaji Varadarajan commented on HUDI-845:
-

Yes [~309637554]. this ticket is for tracking general concurrent writes. 
Supporting partition level concurrency could be the first phase in 
implementation and so we might have to do that first. 

> Allow parallel writing and move the pending rollback work into cleaner
> --
>
> Key: HUDI-845
> URL: https://issues.apache.org/jira/browse/HUDI-845
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.7.0
>
>
> Things to think about 
>  * Commit time has to be unique across writers 
>  * Parallel writers can finish commits out of order i.e c2 commits before c1.
>  * MOR log blocks fence uncommited data.. 
>  * Cleaner should loudly complain if it cannot finish cleaning up partial 
> writes.  
>  
> P.S: think about what is left for the general thing : log files may have 
> different order, inserts may violate uniqueness constraint



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1343) Add standard schema postprocessor which would rewrite the schema using spark-avro conversion

2020-10-13 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1343:
-
Fix Version/s: 0.7.0

> Add standard schema postprocessor which would rewrite the schema using 
> spark-avro conversion
> 
>
> Key: HUDI-1343
> URL: https://issues.apache.org/jira/browse/HUDI-1343
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.7.0
>
>
> When we use Transformer, the final Schema which we use to convert avro record 
> to bytes is auto generated by spark. This could be different (due to the way 
> Avro treats it) from the target schema that is being used to write (as the 
> target schema could be coming from Schema Registry). 
>  
> For example : 
> Schema generated by spark-avro when converting Row to avro
> {
>   "type" : "record",
>   "name" : "hoodie_source",
>   "namespace" : "hoodie.source",
>   "fields" : [ {
>     "name" : "_ts_ms",
>     "type" : [ "long", "null" ]
>   }, {
>     "name" : "_op",
>     "type" : "string"
>   }, {
>     "name" : "inc_id",
>     "type" : "int"
>   }, {
>     "name" : "year",
>     "type" : [ "int", "null" ]
>   }, {
>     "name" : "violation_desc",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "violation_code",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "case_individual_id",
>     "type" : [ "int", "null" ]
>   }, {
>     "name" : "flag",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "last_modified_ts",
>     "type" : "long"
>   } ]
> }
>  
> is not compatible with the Avro Schema:
>  
> {
>   "type" : "record",
>   "name" : "formatted_debezium_payload",
>   "fields" : [ {
>     "name" : "_ts_ms",
>     "type" : [ "null", "long" ],
>     "default" : null
>   }, {
>     "name" : "_op",
>     "type" : "string",
>     "default" : null
>   }, {
>     "name" : "inc_id",
>     "type" : "int",
>     "default" : null
>   }, {
>     "name" : "year",
>     "type" : [ "null", "int" ],
>     "default" : null
>   }, {
>     "name" : "violation_desc",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "violation_code",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "case_individual_id",
>     "type" : [ "null", "int" ],
>     "default" : null
>   }, {
>     "name" : "flag",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "last_modified_ts",
>     "type" : "long",
>     "default" : null
>   } ]
> }
>  
> Note that the type order is different for individual fields : 
> "type" : [ "null", "string" ], vs  "type" : [ "string", "null" ]
> Unexpectedly, Avro decoding fails when bytes written with first schema is 
> read using second schema.
>  
> One way to fix is to use configured target schema when generating record 
> bytes but this is not easy without breaking Record payload constructor API 
> used by deltastreamer. 
> The other option is to apply a post-processor on target schema to make it 
> schema consistent with Transformer generated records.
>  
> This ticket is to use the later approach of creating a standard schema 
> post-processor and adding it by default when Transformer is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1343) Add standard schema postprocessor which would rewrite the schema using spark-avro conversion

2020-10-13 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1343:


 Summary: Add standard schema postprocessor which would rewrite the 
schema using spark-avro conversion
 Key: HUDI-1343
 URL: https://issues.apache.org/jira/browse/HUDI-1343
 Project: Apache Hudi
  Issue Type: Improvement
  Components: DeltaStreamer
Reporter: Balaji Varadarajan


When we use Transformer, the final Schema which we use to convert avro record 
to bytes is auto generated by spark. This could be different (due to the way 
Avro treats it) from the target schema that is being used to write (as the 
target schema could be coming from Schema Registry). 

 

For example : 

Schema generated by spark-avro when converting Row to avro

{

  "type" : "record",

  "name" : "hoodie_source",

  "namespace" : "hoodie.source",

  "fields" : [ {

    "name" : "_ts_ms",

    "type" : [ "long", "null" ]

  }, {

    "name" : "_op",

    "type" : "string"

  }, {

    "name" : "inc_id",

    "type" : "int"

  }, {

    "name" : "year",

    "type" : [ "int", "null" ]

  }, {

    "name" : "violation_desc",

    "type" : [ "string", "null" ]

  }, {

    "name" : "violation_code",

    "type" : [ "string", "null" ]

  }, {

    "name" : "case_individual_id",

    "type" : [ "int", "null" ]

  }, {

    "name" : "flag",

    "type" : [ "string", "null" ]

  }, {

    "name" : "last_modified_ts",

    "type" : "long"

  } ]

}

 

is not compatible with the Avro Schema:

 

{

  "type" : "record",

  "name" : "formatted_debezium_payload",

  "fields" : [ {

    "name" : "_ts_ms",

    "type" : [ "null", "long" ],

    "default" : null

  }, {

    "name" : "_op",

    "type" : "string",

    "default" : null

  }, {

    "name" : "inc_id",

    "type" : "int",

    "default" : null

  }, {

    "name" : "year",

    "type" : [ "null", "int" ],

    "default" : null

  }, {

    "name" : "violation_desc",

    "type" : [ "null", "string" ],

    "default" : null

  }, {

    "name" : "violation_code",

    "type" : [ "null", "string" ],

    "default" : null

  }, {

    "name" : "case_individual_id",

    "type" : [ "null", "int" ],

    "default" : null

  }, {

    "name" : "flag",

    "type" : [ "null", "string" ],

    "default" : null

  }, {

    "name" : "last_modified_ts",

    "type" : "long",

    "default" : null

  } ]

}

 

Note that the type order is different for individual fields : 

"type" : [ "null", "string" ], vs  "type" : [ "string", "null" ]

Unexpectedly, Avro decoding fails when bytes written with first schema is read 
using second schema.

 

One way to fix is to use configured target schema when generating record bytes 
but this is not easy without breaking Record payload constructor API used by 
deltastreamer. 

The other option is to apply a post-processor on target schema to make it 
schema consistent with Transformer generated records.

 

This ticket is to use the later approach of creating a standard schema 
post-processor and adding it by default when Transformer is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1343) Add standard schema postprocessor which would rewrite the schema using spark-avro conversion

2020-10-13 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1343:
-
Status: Open  (was: New)

> Add standard schema postprocessor which would rewrite the schema using 
> spark-avro conversion
> 
>
> Key: HUDI-1343
> URL: https://issues.apache.org/jira/browse/HUDI-1343
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Priority: Major
>
> When we use Transformer, the final Schema which we use to convert avro record 
> to bytes is auto generated by spark. This could be different (due to the way 
> Avro treats it) from the target schema that is being used to write (as the 
> target schema could be coming from Schema Registry). 
>  
> For example : 
> Schema generated by spark-avro when converting Row to avro
> {
>   "type" : "record",
>   "name" : "hoodie_source",
>   "namespace" : "hoodie.source",
>   "fields" : [ {
>     "name" : "_ts_ms",
>     "type" : [ "long", "null" ]
>   }, {
>     "name" : "_op",
>     "type" : "string"
>   }, {
>     "name" : "inc_id",
>     "type" : "int"
>   }, {
>     "name" : "year",
>     "type" : [ "int", "null" ]
>   }, {
>     "name" : "violation_desc",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "violation_code",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "case_individual_id",
>     "type" : [ "int", "null" ]
>   }, {
>     "name" : "flag",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "last_modified_ts",
>     "type" : "long"
>   } ]
> }
>  
> is not compatible with the Avro Schema:
>  
> {
>   "type" : "record",
>   "name" : "formatted_debezium_payload",
>   "fields" : [ {
>     "name" : "_ts_ms",
>     "type" : [ "null", "long" ],
>     "default" : null
>   }, {
>     "name" : "_op",
>     "type" : "string",
>     "default" : null
>   }, {
>     "name" : "inc_id",
>     "type" : "int",
>     "default" : null
>   }, {
>     "name" : "year",
>     "type" : [ "null", "int" ],
>     "default" : null
>   }, {
>     "name" : "violation_desc",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "violation_code",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "case_individual_id",
>     "type" : [ "null", "int" ],
>     "default" : null
>   }, {
>     "name" : "flag",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "last_modified_ts",
>     "type" : "long",
>     "default" : null
>   } ]
> }
>  
> Note that the type order is different for individual fields : 
> "type" : [ "null", "string" ], vs  "type" : [ "string", "null" ]
> Unexpectedly, Avro decoding fails when bytes written with first schema is 
> read using second schema.
>  
> One way to fix is to use configured target schema when generating record 
> bytes but this is not easy without breaking Record payload constructor API 
> used by deltastreamer. 
> The other option is to apply a post-processor on target schema to make it 
> schema consistent with Transformer generated records.
>  
> This ticket is to use the later approach of creating a standard schema 
> post-processor and adding it by default when Transformer is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1329) Support async compaction in spark DF write()

2020-10-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1329:
-
Status: Open  (was: New)

> Support async compaction in spark DF write()
> 
>
> Key: HUDI-1329
> URL: https://issues.apache.org/jira/browse/HUDI-1329
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.7.0
>
>
> spark.write().format("hudi").option(operation, "run_compact") to run 
> compaction
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1329) Support async compaction in spark DF write()

2020-10-09 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1329:


 Summary: Support async compaction in spark DF write()
 Key: HUDI-1329
 URL: https://issues.apache.org/jira/browse/HUDI-1329
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Compaction
Reporter: Balaji Varadarajan
 Fix For: 0.7.0


spark.write().format("hudi").option(operation, "run_compact") to run compaction

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-898) Need to add Schema parameter to HoodieRecordPayload::preCombine

2020-10-02 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-898:

Status: Open  (was: New)

> Need to add Schema parameter to HoodieRecordPayload::preCombine
> ---
>
> Key: HUDI-898
> URL: https://issues.apache.org/jira/browse/HUDI-898
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Yixue Zhu
>Priority: Major
>
> We are working on Mongo Oplog integration with Hudi, to stream Mongo updates 
> to Hudi tables.
> There are 4 Mongo OpLog operations we need to handle, CRUD (create, read, 
> update, delete).
> Currently Hudi handle create/read, delete, but not update well with existing 
> preCombine API in HoodieRecordPayload class. In particularly, Update 
> operation contains "patch" field, which is extended Json describing update 
> for dot separated field paths.
> We need to pass Avro schema to preCombine API for it to work:
> Even though BaseAvroPayload constructor accepts GenericRecord, which has Avro 
> schema reference, but it materialize GenericRecord to bytes, to support 
> serialization/deserialization by ExternalSpillableMap.
>  
> Is there concern/objection to this? in other words, have I overlooked 
> something?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-898) Need to add Schema parameter to HoodieRecordPayload::preCombine

2020-10-02 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-898:
---

Assignee: Balaji Varadarajan

> Need to add Schema parameter to HoodieRecordPayload::preCombine
> ---
>
> Key: HUDI-898
> URL: https://issues.apache.org/jira/browse/HUDI-898
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Yixue Zhu
>Assignee: Balaji Varadarajan
>Priority: Major
>
> We are working on Mongo Oplog integration with Hudi, to stream Mongo updates 
> to Hudi tables.
> There are 4 Mongo OpLog operations we need to handle, CRUD (create, read, 
> update, delete).
> Currently Hudi handle create/read, delete, but not update well with existing 
> preCombine API in HoodieRecordPayload class. In particularly, Update 
> operation contains "patch" field, which is extended Json describing update 
> for dot separated field paths.
> We need to pass Avro schema to preCombine API for it to work:
> Even though BaseAvroPayload constructor accepts GenericRecord, which has Avro 
> schema reference, but it materialize GenericRecord to bytes, to support 
> serialization/deserialization by ExternalSpillableMap.
>  
> Is there concern/objection to this? in other words, have I overlooked 
> something?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1308) Issues found during testing RFC-15

2020-10-01 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205435#comment-17205435
 ] 

Balaji Varadarajan commented on HUDI-1308:
--

cc [~vinoth]

> Issues found during testing RFC-15
> --
>
> Key: HUDI-1308
> URL: https://issues.apache.org/jira/browse/HUDI-1308
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>
> THis is an umbrella ticket containing all the issues found during testing 
> RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1311) Writes creating/updating large number of files seeing errors when deleting marker files in S3

2020-10-01 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1311:


 Summary: Writes creating/updating large number of files seeing 
errors when deleting marker files in S3
 Key: HUDI-1311
 URL: https://issues.apache.org/jira/browse/HUDI-1311
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: Balaji Varadarajan


Dont have the exception trace handy. Will add them when I run into this next 
time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1310) Corruption Block Handling too slow in S3

2020-10-01 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1310:


 Summary: Corruption Block Handling too slow in S3
 Key: HUDI-1310
 URL: https://issues.apache.org/jira/browse/HUDI-1310
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: Balaji Varadarajan


The logic to figure out next valid starting block offset is too slow when run 
in S3. 

I have bolded the log message that takes long time to appear. 

 

 

36589 [Spring Shell] INFO 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning log 
file 
HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0}
36590 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Found corrupted block in file 
HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} with block size(3723305) running past EOF
36684 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Log 
HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} has a corrupted block at 14
*44515 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Next available block in* 
HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} starts at 3723319
44566 [Spring Shell] INFO 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a 
corrupt block in 
s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1308) Issues found during testing RFC-15

2020-10-01 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1308:


Assignee: Balaji Varadarajan  (was: Prashant Wason)

> Issues found during testing RFC-15
> --
>
> Key: HUDI-1308
> URL: https://issues.apache.org/jira/browse/HUDI-1308
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>
> THis is an umbrella ticket containing all the issues found during testing 
> RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1308) Issues found during testing RFC-15

2020-10-01 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1308:


Assignee: Prashant Wason

> Issues found during testing RFC-15
> --
>
> Key: HUDI-1308
> URL: https://issues.apache.org/jira/browse/HUDI-1308
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Prashant Wason
>Priority: Major
>
> THis is an umbrella ticket containing all the issues found during testing 
> RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted

2020-10-01 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1309:


Assignee: Prashant Wason

> Listing Metadata unreadable in S3 as the log block is deemed corrupted
> --
>
> Key: HUDI-1309
> URL: https://issues.apache.org/jira/browse/HUDI-1309
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Prashant Wason
>Priority: Major
>
> When running metadata list-partitions CLI command, I am seeing the below 
> messages and the partition list is empty. Was expecting 10K partitions.
>  
> 36589 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning 
> log file 
> HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0}
> 36590 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block 
> in file 
> HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} with block size(3723305) running past EOF
> 36684 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Log 
> HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} has a corrupted block at 14
> 44515 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block 
> in 
> HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} starts at 3723319
> 44566 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a 
> corrupt block in 
> s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045
> 44567 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted

2020-10-01 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1309:


 Summary: Listing Metadata unreadable in S3 as the log block is 
deemed corrupted
 Key: HUDI-1309
 URL: https://issues.apache.org/jira/browse/HUDI-1309
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: Balaji Varadarajan


When running metadata list-partitions CLI command, I am seeing the below 
messages and the partition list is empty. Was expecting 10K partitions.

 

36589 [Spring Shell] INFO 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning log 
file 
HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0}
36590 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Found corrupted block in file 
HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} with block size(3723305) running past EOF
36684 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Log 
HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} has a corrupted block at 14
44515 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Next available block in 
HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} starts at 3723319
44566 [Spring Shell] INFO 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a 
corrupt block in 
s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045
44567 [Spring Shell] INFO 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1308) Issues found during testing RFC-15

2020-10-01 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1308:


 Summary: Issues found during testing RFC-15
 Key: HUDI-1308
 URL: https://issues.apache.org/jira/browse/HUDI-1308
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: Balaji Varadarajan


THis is an umbrella ticket containing all the issues found during testing RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1308) Issues found during testing RFC-15

2020-10-01 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1308:
-
Status: Open  (was: New)

> Issues found during testing RFC-15
> --
>
> Key: HUDI-1308
> URL: https://issues.apache.org/jira/browse/HUDI-1308
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
>
> THis is an umbrella ticket containing all the issues found during testing 
> RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1257) Insert only write operations should preserve duplicate records

2020-09-24 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201628#comment-17201628
 ] 

Balaji Varadarajan commented on HUDI-1257:
--

[~nicholasjiang]: Yes, they are same. You can dupe one of those jira.

> Insert only write operations should preserve duplicate records
> --
>
> Key: HUDI-1257
> URL: https://issues.apache.org/jira/browse/HUDI-1257
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Nicholas Jiang
>Priority: Major
> Fix For: 0.6.1
>
>
> [https://github.com/apache/hudi/issues/2051]
>  
> ```
>  I think the point [@jiegzhan|https://github.com/jiegzhan] pointed out is 
> reasonable, for insert operation, we should not update the existing records. 
> Right now the behavior/result is different when setting different small file 
> limit, when it is set to 0, the new inserts will not update the old records 
> and write into a new file, but when it is set to other value such as 128M, 
> the new inserts may update the old records lies in small file picked up the 
> UpsertPartitioner.
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1290) Implement Debezium avro source for Delta Streamer

2020-09-21 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1290:
-
Status: Open  (was: New)

> Implement Debezium avro source for Delta Streamer
> -
>
> Key: HUDI-1290
> URL: https://issues.apache.org/jira/browse/HUDI-1290
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> We need to implement transformer and payloads for seamlessly pulling change 
> logs emitted by debezium in Kafka. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1290) Implement Debezium avro source for Delta Streamer

2020-09-21 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1290:


Assignee: Balaji Varadarajan

> Implement Debezium avro source for Delta Streamer
> -
>
> Key: HUDI-1290
> URL: https://issues.apache.org/jira/browse/HUDI-1290
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> We need to implement transformer and payloads for seamlessly pulling change 
> logs emitted by debezium in Kafka. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1290) Implement Debezium avro source for Delta Streamer

2020-09-21 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1290:


 Summary: Implement Debezium avro source for Delta Streamer
 Key: HUDI-1290
 URL: https://issues.apache.org/jira/browse/HUDI-1290
 Project: Apache Hudi
  Issue Type: Improvement
  Components: DeltaStreamer
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


We need to implement transformer and payloads for seamlessly pulling change 
logs emitted by debezium in Kafka. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1270) NoSuchMethod PartitionedFile on AWS EMR Spark 2.4.5

2020-09-13 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1270:
-
Status: Open  (was: New)

> NoSuchMethod PartitionedFile on AWS EMR Spark 2.4.5
> ---
>
> Key: HUDI-1270
> URL: https://issues.apache.org/jira/browse/HUDI-1270
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Gary Li
>Priority: Major
>
> There are some AWS EMR users reporting:
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.PartitionedFile.
> on EMR (Spark-2.4.5-amzn-0) when using the Spark Datasource to query MOR 
> table.
> [https://github.com/apache/hudi/pull/1848#issuecomment-687392285]
> [https://github.com/apache/hudi/issues/2057#issuecomment-685015564]
> [~uditme] [~vbalaji] would you guys able to help?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1270) NoSuchMethod PartitionedFile on AWS EMR Spark 2.4.5

2020-09-13 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195158#comment-17195158
 ] 

Balaji Varadarajan commented on HUDI-1270:
--

[~uditme] : Pinging 

> NoSuchMethod PartitionedFile on AWS EMR Spark 2.4.5
> ---
>
> Key: HUDI-1270
> URL: https://issues.apache.org/jira/browse/HUDI-1270
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Gary Li
>Priority: Major
>
> There are some AWS EMR users reporting:
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.PartitionedFile.
> on EMR (Spark-2.4.5-amzn-0) when using the Spark Datasource to query MOR 
> table.
> [https://github.com/apache/hudi/pull/1848#issuecomment-687392285]
> [https://github.com/apache/hudi/issues/2057#issuecomment-685015564]
> [~uditme] [~vbalaji] would you guys able to help?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1280) Add tool to capture earliest or latest offsets in kafka topics

2020-09-13 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1280:


 Summary: Add tool to capture earliest or latest offsets in kafka 
topics 
 Key: HUDI-1280
 URL: https://issues.apache.org/jira/browse/HUDI-1280
 Project: Apache Hudi
  Issue Type: Improvement
  Components: DeltaStreamer
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


For bootstrapping cases using spark.write(), we need to capture offsets from 
kafka topic and use it as checkpoint for subsequent read from Kafka topics.

 

[https://github.com/apache/hudi/issues/1985]

We need to build this integration for smooth transition to deltastreamer.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >