[GitHub] [hudi] rahulpoptani commented on issue #2180: [SUPPORT] Unable to read MERGE ON READ table with Snapshot option using Databricks.

2020-10-19 Thread GitBox


rahulpoptani commented on issue #2180:
URL: https://github.com/apache/hudi/issues/2180#issuecomment-712585912


   I used a different environment where I used Spark 2.4.5 with Scala 2.12 and 
I was able to successfully perform Insert/Upsert/Deletes and Read on 
Merge-On-Read table type.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] SteNicholas commented on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-10-19 Thread GitBox


SteNicholas commented on pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#issuecomment-712566489


   > @SteNicholas @leesf : Does this essentially mean we no longer support 
small file handling for "inserts" ?
   > If user doesn't essentially care about duplicates, I agree that we need to 
have same behavior w/o small file handling. Instead of this approach, can we 
create a new type of Write Handle which looks like MergeHandle but does not 
merge but rather appends records and creates a new version of Parquet file. You 
can then use this Handle instead of UpdateHandle when pure insert operation 
type is used.
   > 
   > cc @vinothchandar
   
   Yes, user doesn't essentially care about duplicates for small files and the 
same behavior w/o small file handling makes sense.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-10-19 Thread GitBox


bvaradar commented on pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#issuecomment-712564198


   @SteNicholas @leesf : Does this essentially mean we no longer support small 
file handling for "inserts" ? 
   If user doesn't essentially care about duplicates, I agree that we need to 
have same behavior w/o small file handling. Instead of this approach, can we 
create a new type of Write Handle which looks like MergeHandle but does not 
merge but rather appends records and creates a new version of Parquet file. You 
can then use this Handle instead of UpdateHandle when pure insert operation 
type is used.
   
   cc @vinothchandar 
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on pull request #2127: [HUDI-284] add more test for UpdateSchemaEvolution

2020-10-19 Thread GitBox


lw309637554 commented on pull request #2127:
URL: https://github.com/apache/hudi/pull/2127#issuecomment-712536085


   > @lw309637554 looks like comments from @pratyakshsharma were addressed. 
sorry about the delay. merging now. Thank you @lw309637554 for adding the cases!
   
   Thanks 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-10-19 Thread GitBox


vinothchandar commented on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-712505539


   >we want to make Hudi compile with spark 2 and then run with spark3?
   
   this was the intention. but as @bschell pointed out some classes have 
changed and we need to make parts of `hudi-spark` modular and plugin spark 
version specific implementations. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-303) Avro schema case sensitivity testing

2020-10-19 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217177#comment-17217177
 ] 

Vinoth Chandar commented on HUDI-303:
-

[~309637554] this task is about exploring all possibilities and making a call.  
IIUC you are making the case for retaining the lower casing. I think what you 
point out is why we lower cased this. 

I can't decide for myself until we paint the full picture. :) 

> Avro schema case sensitivity testing
> 
>
> Key: HUDI-303
> URL: https://issues.apache.org/jira/browse/HUDI-303
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: liwei
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> As a fallout of [PR 956|https://github.com/apache/incubator-hudi/pull/956] we 
> would like to understand how Avro behaves with case sensitive column names.
> Couple of action items:
>  * Test with different field names just differing in case.
>  * *AbstractRealtimeRecordReader* is one of the classes where we are 
> converting Avro Schema field names to lower case, to be able to verify them 
> against column names from Hive. We can consider removing the *lowercase* 
> conversion there if we verify it does not break anything.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1321) Support properties for metadata table via a properties.file

2020-10-19 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1321:
-
Status: Open  (was: New)

> Support properties for metadata table via a properties.file
> ---
>
> Key: HUDI-1321
> URL: https://issues.apache.org/jira/browse/HUDI-1321
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Priority: Major
>
> metadata properties should be in its own namespace



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1321) Support properties for metadata table via a properties.file

2020-10-19 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1321:
-
Status: In Progress  (was: Open)

> Support properties for metadata table via a properties.file
> ---
>
> Key: HUDI-1321
> URL: https://issues.apache.org/jira/browse/HUDI-1321
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Priority: Major
>
> metadata properties should be in its own namespace



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] umehrot2 commented on issue #2057: [SUPPORT] AWSDmsAvroPayload not processing Deletes correctly + IOException when reading log file

2020-10-19 Thread GitBox


umehrot2 commented on issue #2057:
URL: https://github.com/apache/hudi/issues/2057#issuecomment-712454567


   > @umehrot2 It looks like for 0.6.0 where this issue is fixed, @WTa-hash is 
seeing the exception `java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.PartitionedFile.` originating from 
`at 
org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:142)
 `. Any ideas if this is wrong spark versions ?
   
   @n3nash this issue is because of EMR environment. The jar @WTa-hash was 
building for 0.6.0 is not compiled against EMR's spark version. EMR has its own 
spark with various modifications. This issue should not be there in emr-5.31.0 
where hudi 0.6.0 is officially supported. No need to replace that jar's there. 
The jar's we provide there are compiled against our own spark and it fixes this 
issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on issue #2057: [SUPPORT] AWSDmsAvroPayload not processing Deletes correctly + IOException when reading log file

2020-10-19 Thread GitBox


umehrot2 commented on issue #2057:
URL: https://github.com/apache/hudi/issues/2057#issuecomment-712453460


   > > @umehrot2 Could the IOException be due to #2089 ?
   > 
   > I'm not entirely sure if it's related to this issue as the steps to 
reproduce is different, but the thing I see in common is that both issues are 
referencing a MOR table. I don't get this issue when my table is COW.
   
   @n3nash @WTa-hash Apologies for the late response here. I think we need the 
full stack trace here to be able to debug this. Looking at 
https://github.com/apache/hudi/blob/release-0.6.0/hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordScanner.java#L244
 it seems that the full exception is only logged at the executors. So, either 
the executor logs should be checked to see the full trace, or change the line 
here to throw the actual exception back to the driver. If this is still 
happening, you can also open a jira with some basic reproduction steps if 
possible. I do not personally think that it would be related to #2089 but we 
cannot be certain without seeing the full exception.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on pull request #2189: Some more updates to the rfc-15 implementation

2020-10-19 Thread GitBox


prashantwason commented on pull request #2189:
URL: https://github.com/apache/hudi/pull/2189#issuecomment-712449097


   @vinothchandar  Some more updates from my side. PTAL.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason opened a new pull request #2189: Some more updates to the rfc-15 implementation

2020-10-19 Thread GitBox


prashantwason opened a new pull request #2189:
URL: https://github.com/apache/hudi/pull/2189


   ## Brief change log
   
   Please see individual commits and the tagged JIRA items for details.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 merged pull request #2185: [HUDI-1345] Remove Hbase and htrace relocation from utilities bundle

2020-10-19 Thread GitBox


umehrot2 merged pull request #2185:
URL: https://github.com/apache/hudi/pull/2185


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1345] Remove Hbase and htrace relocation from utilities bundle (#2185)

2020-10-19 Thread uditme
This is an automated email from the ASF dual-hosted git repository.

uditme pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 6490b02  [HUDI-1345] Remove Hbase and htrace relocation from utilities 
bundle (#2185)
6490b02 is described below

commit 6490b029dd05e5a3d704ebd5b314899a53dd76fb
Author: Bhavani Sudha Saktheeswaran 
AuthorDate: Mon Oct 19 14:11:08 2020 -0700

[HUDI-1345] Remove Hbase and htrace relocation from utilities bundle (#2185)
---
 packaging/hudi-utilities-bundle/pom.xml | 8 
 1 file changed, 8 deletions(-)

diff --git a/packaging/hudi-utilities-bundle/pom.xml 
b/packaging/hudi-utilities-bundle/pom.xml
index 91ae5fd..39e48bb 100644
--- a/packaging/hudi-utilities-bundle/pom.xml
+++ b/packaging/hudi-utilities-bundle/pom.xml
@@ -168,14 +168,6 @@
   
org.apache.hudi.org.apache.commons.codec.
 
 
-  org.apache.hadoop.hbase.
-  
org.apache.hudi.org.apache.hadoop.hbase.
-
-
-  org.apache.htrace.
-  
org.apache.hudi.org.apache.htrace.
-
-
   org.eclipse.jetty.
   
org.apache.hudi.org.eclipse.jetty.
 



[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-19 Thread GitBox


ashishmgofficial edited a comment on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-712404711


   Not sure if this is gonna be of any help but attaching the latest logs. I 
can see this messages towards the end
   ```
   at scala.collection.Iterator$class.isEmpty(Iterator.scala:331)
   at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1334)
   at 
org.apache.hudi.AvroConversionUtils$$anonfun$2.apply(AvroConversionUtils.scala:46)
   at 
org.apache.hudi.AvroConversionUtils$$anonfun$2.apply(AvroConversionUtils.scala:45)
   
   ```
   [o.log](https://github.com/apache/hudi/files/5404352/o.log)
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-19 Thread GitBox


ashishmgofficial edited a comment on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-712404711


   Not sure if this is gonna be of any help but attaching the latest logs. I 
can see this messages towards the end
   ```
   at scala.collection.Iterator$class.isEmpty(Iterator.scala:331)
   at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1334)
   **at 
org.apache.hudi.AvroConversionUtils$$anonfun$2.apply(AvroConversionUtils.scala:46)
   at 
org.apache.hudi.AvroConversionUtils$$anonfun$2.apply(AvroConversionUtils.scala:45)**
   
   ```
   [o.log](https://github.com/apache/hudi/files/5404352/o.log)
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-19 Thread GitBox


ashishmgofficial commented on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-712404711


   Not sure if this is gonna be of any help but attaching the latest logs. I 
can see this messages towards the end
   ```
   at scala.collection.Iterator$class.isEmpty(Iterator.scala:331)
   at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1334)
   **at 
org.apache.hudi.AvroConversionUtils$$anonfun$2.apply(AvroConversionUtils.scala:46)
   at 
org.apache.hudi.AvroConversionUtils$$anonfun$2.apply(AvroConversionUtils.scala:45)**
   
   ```
   [Uploading o.log…]()
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhedoubushishi commented on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-10-19 Thread GitBox


zhedoubushishi commented on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-712391147


   @bschell @vinothchandar to make clear, just wondering what is the exact goal 
for this pr? Do we want to make Hudi support both compile & run with spark 3 or 
we want to make Hudi compile with spark 2 and then run with spark3?
   
   Ideally we should make Hudi both compile and run with Spark3. But current 
code change cannot compile with spark 3.
   
   Run
   
   ```
   mvn clean install -DskipTests -DskipITs -Dspark.version=3.0.0 -Pscala-2.12
   ```
   
   returns
   
   ```
   [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.8.0:testCompile 
(default-testCompile) on project hudi-client: Compilation failure
   [ERROR] 
/Users/wenningd/workplace/Aws157Hudi/src/Aws157Hudi/hudi-client/src/test/java/org/apache/hudi/testutils/SparkDatasetTestUtils.java:[146,27]
 cannot find symbol
   [ERROR]   symbol:   method toRow(org.apache.spark.sql.Row)
   [ERROR]   location: variable encoder of type 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
   [ERROR] 
   
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-19 Thread GitBox


ashishmgofficial commented on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-712377184


   @bvaradar  I can provide all the SQL's in Postgres which I'm using to 
reproduce this though : 
   
   ```
   DROP TABLE public.motor_crash_violation_incidents;
   
   CREATE TABLE public.motor_crash_violation_incidents (
inc_id serial ,
"year" int4 NULL,
violation_desc varchar(100) NULL,
violation_code varchar(20) NULL,
case_individual_id int4 NULL,
flag varchar(1) NULL,
last_modified_ts timestamp not NULL,
CONSTRAINT motor_crash_violation_incidents_pkey PRIMARY KEY (inc_id)
   );
   
   ALTER TABLE public.motor_crash_violation_incidents REPLICA IDENTITY FULL;
   ```
   Insert records : 
   
   ```
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(1, 2016, 'DRIVING WHILE INTOXICATED', '11923', 17475366, 'I', 
'2020-09-24 11:03:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(3, 2016, 'AGGRAVATED UNLIC OPER 2ND/PREV CONV', '5112A1', 17475367, 
'U', '2020-09-24 15:00:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(4, 2019, 'AGGRAVATED UNLIC OPER 2ND/PREV', '5112A2', 17475368, 'I', 
'2019-09-24 15:00:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(2, 2020, 'UNREASONABLE SPEED/SPECIAL HAZARDS', '2180F', 17475569, 
'U', '2020-09-29 11:00:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(9, 2020, 'UNREASONABLE SPEED/SPECIAL HAZARDS', '1180E', 17475573, 
'I', '2020-09-29 11:00:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(10, 2020, 'UNREASONABLE SPEED/SPECIAL HAZARDS', '1180D', 17475574, 
'I', '2020-09-29 11:10:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(11, 2020, 'UNREASONABLE SPEED/SPECIAL HAZARDS', '1180D', 17475574, 
'I', '2020-09-29 12:10:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(12, 2020, 'UNREASONABLE SPEED/SPECIAL HAZARDS', '1180E', 17475574, 
'I', '2020-09-29 13:10:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(13, 2020, 'UNREASONABLE SPEED/SPECIAL HAZARDS', '1180E', 17475574, 
'I', '2020-09-29 14:10:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(34, 2020, 'UNREASONABLE SPEED/SPECIAL HAZARDS', '1180E', 17475574, 
'I', '2020-09-29 15:10:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(35, 2020, 'UNREASONABLE SPEED/SPECIAL HAZARDS', '1180E', 17475574, 
'I', '2020-09-29 16:10:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(36, 2020, 'UNREASONABLE SPEED/SPECIAL HAZARDS', '1180E', 17475574, 
'I', '2020-09-29 17:00:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(37, 2020, 'UNREASONABLE SPEED/SPECIAL HAZARDS', '1180E', 17475574, 
'I', '2020-09-29 17:10:00.000');
   commit;
   INSERT INTO public.motor_crash_violation_incidents
   (inc_id, "year", violation_desc, violation_code, case_individual_id, flag, 
last_modified_ts)
   VALUES(38, 2020, 'UNREASONABLE SPEED/SPECIAL HAZARDS', '1180D', 17475574, 
'I', '2020-09-29 18:00:00.000');
   commit;
   ```
   
   Issue Delete : 
   
   ```
   DELETE FROM public.motor_crash_violation_incidents
   WHERE inc_id=3;
   ```
   
   
   
   These changes are automatically picked by the Confluent Kafka's Postgres 
Debezium Connector and written to topic



This is an automated message from the Apache Git Service.
To respond to the 

[GitHub] [hudi] bvaradar closed issue #2108: [SUPPORT]Submit rollback -->Pending job --> kill YARN --> lost data

2020-10-19 Thread GitBox


bvaradar closed issue #2108:
URL: https://github.com/apache/hudi/issues/2108


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1340) Not able to query real time table when rows contains nested elements

2020-10-19 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216913#comment-17216913
 ] 

Balaji Varadarajan commented on HUDI-1340:
--

[~bdighe]: Did you use --conf spark.sql.hive.convertMetastoreParquet=false when 
you started your spark-shell where you are running the query ?

 

https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-Whydowehavetoset2differentwaysofconfiguringSparktoworkwithHudi?

> Not able to query real time table when rows contains nested elements
> 
>
> Key: HUDI-1340
> URL: https://issues.apache.org/jira/browse/HUDI-1340
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Bharat Dighe
>Priority: Major
> Attachments: create_avro.py, user.avsc, users1.avro, users2.avro, 
> users3.avro, users4.avro, users5.avro
>
>
> AVRO schema: Attached
> Script to generate sample data: attached
> Sample data attached
> ==
> the schema as nested elements, here is the output from hive
> {code:java}
>   CREATE EXTERNAL TABLE `users_mor_rt`( 
>  `_hoodie_commit_time` string, 
>  `_hoodie_commit_seqno` string, 
>  `_hoodie_record_key` string, 
>  `_hoodie_partition_path` string, 
>  `_hoodie_file_name` string, 
>  `name` string, 
>  `userid` int, 
>  `datehired` string, 
>  `meta` struct, 
>  `experience` 
> struct>>) 
>  PARTITIONED BY ( 
>  `role` string) 
>  ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
>  STORED AS INPUTFORMAT 
>  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' 
>  OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
>  LOCATION 
>  'hdfs://namenode:8020/tmp/hudi_repair_order_mor' 
>  TBLPROPERTIES ( 
>  'last_commit_time_sync'='20201011190954', 
>  'transient_lastDdlTime'='1602442906')
> {code}
> scala  code:
> {code:java}
> import java.io.File
> import org.apache.hudi.QuickstartUtils._
> import org.apache.spark.sql.SaveMode._
> import org.apache.avro.Schema
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._
> val tableName = "users_mor"
> //  val basePath = "hdfs:///tmp/hudi_repair_order_mor"
> val basePath = "hdfs:///tmp/hudi_repair_order_mor"
> //  Insert Data
> /// local not hdfs !!!
> //val schema = new Schema.Parser().parse(new 
> File("/var/hoodie/ws/docker/demo/data/user/user.avsc"))
> def updateHudi( num:String, op:String) = {
> val path = "hdfs:///var/demo/data/user/users" + num + ".avro"
> println( path );
> val avdf2 =  new org.apache.spark.sql.SQLContext(sc).read.format("avro").
> // option("avroSchema", schema.toString).
> load(path)
> avdf2.select("name").show(false)
> avdf2.write.format("hudi").
> options(getQuickstartWriteConfigs).
> option(OPERATION_OPT_KEY,op).
> option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ"). // 
> default:COPY_ON_WRITE, MERGE_ON_READ
> option(KEYGENERATOR_CLASS_OPT_KEY, 
> "org.apache.hudi.keygen.ComplexKeyGenerator").
> option(PRECOMBINE_FIELD_OPT_KEY, "meta.ingestTime").   // dedup
> option(RECORDKEY_FIELD_OPT_KEY, "userId").   // key
> option(PARTITIONPATH_FIELD_OPT_KEY, "role").
> option(TABLE_NAME, tableName).
> option("hoodie.compact.inline", false).
> option(HIVE_STYLE_PARTITIONING_OPT_KEY, "true").
> option(HIVE_SYNC_ENABLED_OPT_KEY, "true").
> option(HIVE_TABLE_OPT_KEY, tableName).
> option(HIVE_USER_OPT_KEY, "hive").
> option(HIVE_PASS_OPT_KEY, "hive").
> option(HIVE_URL_OPT_KEY, "jdbc:hive2://hiveserver:1").
> option(HIVE_PARTITION_FIELDS_OPT_KEY, "role").
> option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor").
> option("hoodie.datasource.hive_sync.assume_date_partitioning", 
> "false").
> mode(Append).
> save(basePath)
> spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, experience.companies[0] from " + tableName + 
> "_rt").show()
> spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, _hoodie_commit_seqno from " + tableName + 
> "_ro").show()
> }
> updateHudi("1", "bulkinsert")
> updateHudi("2", "upsert")
> updateHudi("3", "upsert")
> updateHudi("4", "upsert")
> {code}
> If nested fields are not included, it works fine
> {code}
> scala> spark.sql("select name from users_mor_rt");
> res19: org.apache.spark.sql.DataFrame = [name: string]
> scala> spark.sql("select name from users_mor_rt").show();
> +-+
> | name|
> +-+
> |engg3|
> |engg1_new|
> |engg2_new|
> | mgr1|
> | mgr2|
> |  

[GitHub] [hudi] bvaradar commented on issue #2108: [SUPPORT]Submit rollback -->Pending job --> kill YARN --> lost data

2020-10-19 Thread GitBox


bvaradar commented on issue #2108:
URL: https://github.com/apache/hudi/issues/2108#issuecomment-712304151


   Closing this due to inactivity.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1340) Not able to query real time table when rows contains nested elements

2020-10-19 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1340:
-
Status: Open  (was: New)

> Not able to query real time table when rows contains nested elements
> 
>
> Key: HUDI-1340
> URL: https://issues.apache.org/jira/browse/HUDI-1340
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Bharat Dighe
>Priority: Major
> Attachments: create_avro.py, user.avsc, users1.avro, users2.avro, 
> users3.avro, users4.avro, users5.avro
>
>
> AVRO schema: Attached
> Script to generate sample data: attached
> Sample data attached
> ==
> the schema as nested elements, here is the output from hive
> {code:java}
>   CREATE EXTERNAL TABLE `users_mor_rt`( 
>  `_hoodie_commit_time` string, 
>  `_hoodie_commit_seqno` string, 
>  `_hoodie_record_key` string, 
>  `_hoodie_partition_path` string, 
>  `_hoodie_file_name` string, 
>  `name` string, 
>  `userid` int, 
>  `datehired` string, 
>  `meta` struct, 
>  `experience` 
> struct>>) 
>  PARTITIONED BY ( 
>  `role` string) 
>  ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
>  STORED AS INPUTFORMAT 
>  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' 
>  OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
>  LOCATION 
>  'hdfs://namenode:8020/tmp/hudi_repair_order_mor' 
>  TBLPROPERTIES ( 
>  'last_commit_time_sync'='20201011190954', 
>  'transient_lastDdlTime'='1602442906')
> {code}
> scala  code:
> {code:java}
> import java.io.File
> import org.apache.hudi.QuickstartUtils._
> import org.apache.spark.sql.SaveMode._
> import org.apache.avro.Schema
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._
> val tableName = "users_mor"
> //  val basePath = "hdfs:///tmp/hudi_repair_order_mor"
> val basePath = "hdfs:///tmp/hudi_repair_order_mor"
> //  Insert Data
> /// local not hdfs !!!
> //val schema = new Schema.Parser().parse(new 
> File("/var/hoodie/ws/docker/demo/data/user/user.avsc"))
> def updateHudi( num:String, op:String) = {
> val path = "hdfs:///var/demo/data/user/users" + num + ".avro"
> println( path );
> val avdf2 =  new org.apache.spark.sql.SQLContext(sc).read.format("avro").
> // option("avroSchema", schema.toString).
> load(path)
> avdf2.select("name").show(false)
> avdf2.write.format("hudi").
> options(getQuickstartWriteConfigs).
> option(OPERATION_OPT_KEY,op).
> option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ"). // 
> default:COPY_ON_WRITE, MERGE_ON_READ
> option(KEYGENERATOR_CLASS_OPT_KEY, 
> "org.apache.hudi.keygen.ComplexKeyGenerator").
> option(PRECOMBINE_FIELD_OPT_KEY, "meta.ingestTime").   // dedup
> option(RECORDKEY_FIELD_OPT_KEY, "userId").   // key
> option(PARTITIONPATH_FIELD_OPT_KEY, "role").
> option(TABLE_NAME, tableName).
> option("hoodie.compact.inline", false).
> option(HIVE_STYLE_PARTITIONING_OPT_KEY, "true").
> option(HIVE_SYNC_ENABLED_OPT_KEY, "true").
> option(HIVE_TABLE_OPT_KEY, tableName).
> option(HIVE_USER_OPT_KEY, "hive").
> option(HIVE_PASS_OPT_KEY, "hive").
> option(HIVE_URL_OPT_KEY, "jdbc:hive2://hiveserver:1").
> option(HIVE_PARTITION_FIELDS_OPT_KEY, "role").
> option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor").
> option("hoodie.datasource.hive_sync.assume_date_partitioning", 
> "false").
> mode(Append).
> save(basePath)
> spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, experience.companies[0] from " + tableName + 
> "_rt").show()
> spark.sql("select name, _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, _hoodie_commit_seqno from " + tableName + 
> "_ro").show()
> }
> updateHudi("1", "bulkinsert")
> updateHudi("2", "upsert")
> updateHudi("3", "upsert")
> updateHudi("4", "upsert")
> {code}
> If nested fields are not included, it works fine
> {code}
> scala> spark.sql("select name from users_mor_rt");
> res19: org.apache.spark.sql.DataFrame = [name: string]
> scala> spark.sql("select name from users_mor_rt").show();
> +-+
> | name|
> +-+
> |engg3|
> |engg1_new|
> |engg2_new|
> | mgr1|
> | mgr2|
> |  devops1|
> |  devops2|
> +-+
> {code}
> But fails when I include nested field 'experience'
> {code}
> scala> spark.sql("select name, experience from users_mor_rt").show();
> 20/10/11 19:53:58 ERROR executor.Executor: Exception in task 0.0 in stage 
> 147.0 (TID 153)
> 

[GitHub] [hudi] bvaradar commented on issue #2162: [SUPPORT] Deltastreamer transform cannot add fields

2020-10-19 Thread GitBox


bvaradar commented on issue #2162:
URL: https://github.com/apache/hudi/issues/2162#issuecomment-712289095


   @liujinhui1994 : Adding 
   
   > Can work, but if the default value is not null, it will not work
   > {
   > "name": "adnetDesc",
   > "type": ["null", "long"],
   > "default": -1
   > }
   > @bvaradar
   
   Let's discuss this in PR once you open it. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan merged pull request #2127: [HUDI-284] add more test for UpdateSchemaEvolution

2020-10-19 Thread GitBox


xushiyan merged pull request #2127:
URL: https://github.com/apache/hudi/pull/2127


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-284] add more test for UpdateSchemaEvolution (#2127)

2020-10-19 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 4d80e1e  [HUDI-284] add more test for UpdateSchemaEvolution (#2127)
4d80e1e is described below

commit 4d80e1e221b0ddfd542baeee9b6cbb3b28a88e68
Author: lw0090 
AuthorDate: Mon Oct 19 22:38:04 2020 +0800

[HUDI-284] add more test for UpdateSchemaEvolution (#2127)

Unit test different schema evolution scenarios.
---
 .../java/org/apache/hudi/io/HoodieMergeHandle.java |  11 +-
 ...EvolvedSchema.txt => exampleEvolvedSchema.avsc} |   0
 ...ma.txt => exampleEvolvedSchemaChangeOrder.avsc} |   8 +-
 txt => exampleEvolvedSchemaColumnRequire.avsc} |   2 +-
 ...ema.txt => exampleEvolvedSchemaColumnType.avsc} |   8 +-
 ...a.txt => exampleEvolvedSchemaDeleteColumn.avsc} |   8 +-
 .../{exampleSchema.txt => exampleSchema.avsc}  |   0
 .../hudi/client/TestUpdateSchemaEvolution.java | 194 +++--
 .../org/apache/hudi/index/TestHoodieIndex.java |   2 +-
 .../hudi/index/bloom/TestHoodieBloomIndex.java |   2 +-
 .../index/bloom/TestHoodieGlobalBloomIndex.java|   2 +-
 .../commit/TestCopyOnWriteActionExecutor.java  |   2 +-
 .../table/action/commit/TestUpsertPartitioner.java |   2 +-
 13 files changed, 159 insertions(+), 82 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
index 77fef5c..faa7ff6 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
@@ -197,7 +197,6 @@ public class HoodieMergeHandle extends H
   } else {
 recordsDeleted++;
   }
-
   writeStatus.markSuccess(hoodieRecord, recordMetadata);
   // deflate record payload after recording success. This will help users 
access payload as a
   // part of marking
@@ -243,16 +242,14 @@ public class HoodieMergeHandle extends H
 if (copyOldRecord) {
   // this should work as it is, since this is an existing record
   String errMsg = "Failed to merge old record into new file for key " + 
key + " from old file " + getOldFilePath()
-  + " to new file " + newFilePath;
+  + " to new file " + newFilePath + " with writerSchema " + 
writerSchemaWithMetafields.toString(true);
   try {
 fileWriter.writeAvro(key, oldRecord);
   } catch (ClassCastException e) {
-LOG.error("Schema mismatch when rewriting old record " + oldRecord + " 
from file " + getOldFilePath()
-+ " to file " + newFilePath + " with writerSchema " + 
writerSchemaWithMetafields.toString(true));
+LOG.debug("Old record is " + oldRecord);
 throw new HoodieUpsertException(errMsg, e);
-  } catch (IOException e) {
-LOG.error("Failed to merge old record into new file for key " + key + 
" from old file " + getOldFilePath()
-+ " to new file " + newFilePath, e);
+  } catch (IOException | RuntimeException e) {
+LOG.debug("Old record is " + oldRecord);
 throw new HoodieUpsertException(errMsg, e);
   }
   recordsWritten++;
diff --git 
a/hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchema.txt 
b/hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchema.avsc
similarity index 100%
copy from 
hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchema.txt
copy to 
hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchema.avsc
diff --git 
a/hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchema.txt 
b/hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchemaChangeOrder.avsc
similarity index 94%
copy from 
hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchema.txt
copy to 
hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchemaChangeOrder.avsc
index c85c3ce..16844ff 100644
--- a/hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchema.txt
+++ 
b/hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchemaChangeOrder.avsc
@@ -21,10 +21,6 @@
 "name": "trip",
 "fields": [
 {
-"name": "number",
-"type": ["int", "null"]
-},
-{
 "name": "time",
 "type": "string"
 },
@@ -35,6 +31,10 @@
 {
 "name": "added_field",
 "type": ["int", "null"]
+},
+{
+ "name": "number",
+ "type": ["int", "null"]
 }
 ]
 }
diff --git 
a/hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchema.txt 
b/hudi-client/hudi-client-common/src/test/resources/exampleEvolvedSchemaColumnRequire.avsc
similarity index 

[GitHub] [hudi] xushiyan commented on pull request #2127: [HUDI-284] add more test for UpdateSchemaEvolution

2020-10-19 Thread GitBox


xushiyan commented on pull request #2127:
URL: https://github.com/apache/hudi/pull/2127#issuecomment-712208378


   @lw309637554 looks like comments from @pratyakshsharma were addressed. sorry 
about the delay. merging now. Thank you @lw309637554 for adding the cases!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-303) Avro schema case sensitivity testing

2020-10-19 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216732#comment-17216732
 ] 

liwei commented on HUDI-303:


[~uditme]    , [~vinoth]   what do you think about this  :D**

> Avro schema case sensitivity testing
> 
>
> Key: HUDI-303
> URL: https://issues.apache.org/jira/browse/HUDI-303
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: liwei
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> As a fallout of [PR 956|https://github.com/apache/incubator-hudi/pull/956] we 
> would like to understand how Avro behaves with case sensitive column names.
> Couple of action items:
>  * Test with different field names just differing in case.
>  * *AbstractRealtimeRecordReader* is one of the classes where we are 
> converting Avro Schema field names to lower case, to be able to verify them 
> against column names from Hive. We can consider removing the *lowercase* 
> conversion there if we verify it does not break anything.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-303) Avro schema case sensitivity testing

2020-10-19 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reopened HUDI-303:


> Avro schema case sensitivity testing
> 
>
> Key: HUDI-303
> URL: https://issues.apache.org/jira/browse/HUDI-303
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: liwei
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> As a fallout of [PR 956|https://github.com/apache/incubator-hudi/pull/956] we 
> would like to understand how Avro behaves with case sensitive column names.
> Couple of action items:
>  * Test with different field names just differing in case.
>  * *AbstractRealtimeRecordReader* is one of the classes where we are 
> converting Avro Schema field names to lower case, to be able to verify them 
> against column names from Hive. We can consider removing the *lowercase* 
> conversion there if we verify it does not break anything.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-303) Avro schema case sensitivity testing

2020-10-19 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216728#comment-17216728
 ] 

liwei commented on HUDI-303:


i do not think this should fix. because hive meta column is case insensitive. 
if do not *lowercase  will not match the hive meta schema with avro schema. 
just like :  hive_metastoreConstants.META_TABLE_COLUMNS will be case 
insensitive.* 

Map schemaFieldsMap = 
HoodieRealtimeRecordReaderUtils.getNameToFieldMap(writerSchema);
hiveSchema = constructHiveOrderedSchema(writerSchema, schemaFieldsMap);

// Get all column names of hive table
String hiveColumnString = 
jobConf.get(hive_metastoreConstants.META_TABLE_COLUMNS);
LOG.info("Hive Columns : " + hiveColumnString);
String[] hiveColumns = hiveColumnString.split(",");
LOG.info("Hive Columns : " + hiveColumnString);
List hiveSchemaFields = new ArrayList<>();

for (String columnName : hiveColumns) {
 Field field = schemaFieldsMap.get(columnName.toLowerCase());

 if (field != null) {
 hiveSchemaFields.add(new Schema.Field(field.name(), field.schema(), 
field.doc(), field.defaultVal()));
 } else {
 // Hive has some extra virtual columns like BLOCK__OFFSET__INSIDE__FILE which 
do not exist in table schema.
 // They will get skipped as they won't be found in the original schema.
 LOG.debug("Skipping Hive Column => " + columnName);
 }
}

> Avro schema case sensitivity testing
> 
>
> Key: HUDI-303
> URL: https://issues.apache.org/jira/browse/HUDI-303
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> As a fallout of [PR 956|https://github.com/apache/incubator-hudi/pull/956] we 
> would like to understand how Avro behaves with case sensitive column names.
> Couple of action items:
>  * Test with different field names just differing in case.
>  * *AbstractRealtimeRecordReader* is one of the classes where we are 
> converting Avro Schema field names to lower case, to be able to verify them 
> against column names from Hive. We can consider removing the *lowercase* 
> conversion there if we verify it does not break anything.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-303) Avro schema case sensitivity testing

2020-10-19 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-303:
--

Assignee: liwei  (was: Udit Mehrotra)

> Avro schema case sensitivity testing
> 
>
> Key: HUDI-303
> URL: https://issues.apache.org/jira/browse/HUDI-303
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: liwei
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> As a fallout of [PR 956|https://github.com/apache/incubator-hudi/pull/956] we 
> would like to understand how Avro behaves with case sensitive column names.
> Couple of action items:
>  * Test with different field names just differing in case.
>  * *AbstractRealtimeRecordReader* is one of the classes where we are 
> converting Avro Schema field names to lower case, to be able to verify them 
> against column names from Hive. We can consider removing the *lowercase* 
> conversion there if we verify it does not break anything.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-303) Avro schema case sensitivity testing

2020-10-19 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-303.

Resolution: Fixed

> Avro schema case sensitivity testing
> 
>
> Key: HUDI-303
> URL: https://issues.apache.org/jira/browse/HUDI-303
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: liwei
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> As a fallout of [PR 956|https://github.com/apache/incubator-hudi/pull/956] we 
> would like to understand how Avro behaves with case sensitive column names.
> Couple of action items:
>  * Test with different field names just differing in case.
>  * *AbstractRealtimeRecordReader* is one of the classes where we are 
> converting Avro Schema field names to lower case, to be able to verify them 
> against column names from Hive. We can consider removing the *lowercase* 
> conversion there if we verify it does not break anything.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] lw309637554 commented on pull request #2127: [HUDI-284] add more test for UpdateSchemaEvolution

2020-10-19 Thread GitBox


lw309637554 commented on pull request #2127:
URL: https://github.com/apache/hudi/pull/2127#issuecomment-712104176


   @pratyakshsharma @xushiyan @vinothchandar  hello, 
   Is there anything that needs to be fixed? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] liujinhui1994 commented on issue #2162: [SUPPORT] Deltastreamer transform cannot add fields

2020-10-19 Thread GitBox


liujinhui1994 commented on issue #2162:
URL: https://github.com/apache/hudi/issues/2162#issuecomment-711759479


   I am late in reply, sorry. I have passed the verification in the production 
environment, and I am currently writing unit tests



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] liujinhui1994 commented on issue #2162: [SUPPORT] Deltastreamer transform cannot add fields

2020-10-19 Thread GitBox


liujinhui1994 commented on issue #2162:
URL: https://github.com/apache/hudi/issues/2162#issuecomment-711757869


   Can work, but if the default value is not null, it will not work
   {
   "name": "adnetDesc",
   "type": ["null", "long"],
   "default": -1
   }
@bvaradar   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] KarthickAN commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

2020-10-19 Thread GitBox


KarthickAN commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-711645166


   @nsivabalan @vinothchandar Thank you so much for all the explanations. If I 
think about it, having 10MB worth of index data may not be an issue as long as 
the file contains considerable amount of records. In my case there was a 
scenario where I had only 1000 records but with 10MB for index. So I switched 
to dynamic bloom now which is really helpful in this case. 
   
   We are dealing with two different types of data out of which one doesn't 
have much volume. That's where it threw it off where as for the other type 
where we do have good volume of data this didn't come out as an issue as we'd 
already have around 110-120MB worth of data plus index. As of now I've 
configure it like below
   
   IndexBloomNumEntries = 35000
   BloomIndexFilterType = DYNAMIC_V0
   BloomIndexFilterDynamicMaxEntries = 140
   
   starting off with 35k (1%  of max no of entries in a file) as a base and 
scaling it out till 1.4M(40% of max no of entries in a file) entries as the 
file grows. So that should solve the problem possibly. Anyways we need to test 
this out for the volume we are seeing right now and tune it further if required.
   
   @vinothchandar Yes. Having a blog around this will definitely be very 
helpful. I felt hudi has a lot of features that can be used efficiently with 
some more in depth explanations than what we have right now as part of the 
documentation. 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org