[GitHub] [incubator-hudi] cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself
cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself URL: https://github.com/apache/incubator-hudi/pull/770#issuecomment-573493381 Close thie PR right now. Some problems have been fixed in PR https://github.com/apache/incubator-hudi/pull/1005. The remaining timestamp type problem will be further discussed in other JIRA's issues. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself
cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself URL: https://github.com/apache/incubator-hudi/pull/770#issuecomment-532921049 I will continue to discuss this issue on JIRA later. The version I'm running in the production environment now is the Hudi 0.4.8 version with this PR added. If there are new changes, I can also do some experiments in my test environment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself
cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself URL: https://github.com/apache/incubator-hudi/pull/770#issuecomment-532919914 @vinothchandar @umehrot2 I've tried changing avro version to 1.8.2 in hudi pom.xml before. Spark 2.2 or 2.3 don't use avro 1.8.2 jars in hoodie jar, It will use avro 1.7.7 first and I will still encounter the same mistake(report missing logical type class and so on). Maybe you can try using below codes in shell to test avro 1.8.2 ``` --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true ``` OR do some change like hive dependence in spark-bundle ``` org.apache.avro com.apache.hudi.org.apache.avro ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself
cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself URL: https://github.com/apache/incubator-hudi/pull/770#issuecomment-532471718 @vinothchandar The jar of avro 1.7.7 under spark can be directly replaced by 1.8.2. I have tested some of codes and proved that direct replacement of jar is a feasible method. In most cases, the method of avro1.8.2 is compatible with avro1.7.7. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself
cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself URL: https://github.com/apache/incubator-hudi/pull/770#issuecomment-532470332 @umehrot2 In addition to the decimal problem, I also modified a timestamp conversion problem. On spark dataset, this PR get the right result. But there are still some problems on Hive and sparksql. Hive 2.3 does not correctly identify the logical-type in parquet-avro file, timestamp type may be cast to long type in Hive 2.3. I modified some of Hive's source in `org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector` code to solve this problem. ``` package org.apache.hadoop.hive.serde2.objectinspector.primitive; import java.sql.Timestamp; import org.apache.hadoop.hive.serde2.io.TimestampWritable; import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory; import org.apache.hadoop.io.LongWritable; public class WritableTimestampObjectInspector extends AbstractPrimitiveWritableObjectInspector implements SettableTimestampObjectInspector { public WritableTimestampObjectInspector() { super(TypeInfoFactory.timestampTypeInfo); } @Override public TimestampWritable getPrimitiveWritableObject(Object o) { if (o instanceof LongWritable) { return (TimestampWritable) PrimitiveObjectInspectorFactory.writableTimestampObjectInspector .create(new Timestamp(((LongWritable) o).get())); } return o == null ? null : (TimestampWritable) o; } public Timestamp getPrimitiveJavaObject(Object o) { if (o instanceof LongWritable) { return new Timestamp(((LongWritable) o).get()); } return o == null ? null : ((TimestampWritable) o).getTimestamp(); } public Object copyObject(Object o) { if (o instanceof LongWritable) { return new TimestampWritable(new Timestamp(((LongWritable) o).get())); } return o == null ? null : new TimestampWritable((TimestampWritable) o); } public Object set(Object o, byte[] bytes, int offset) { if (o instanceof LongWritable) { o = PrimitiveObjectInspectorFactory.writableTimestampObjectInspector .create(new Timestamp(((LongWritable) o).get())); } else ((TimestampWritable) o).set(bytes, offset); return o; } public Object set(Object o, Timestamp t) { if (t == null) { return null; } if (o instanceof LongWritable) { o = PrimitiveObjectInspectorFactory.writableTimestampObjectInspector.create(t); } else ((TimestampWritable) o).set(t); return o; } public Object set(Object o, TimestampWritable t) { if (t == null) { return null; } if (o instanceof LongWritable) { o = PrimitiveObjectInspectorFactory.writableTimestampObjectInspector .create(new Timestamp(((LongWritable) o).get())); } else ((TimestampWritable) o).set(t); return o; } public Object create(byte[] bytes, int offset) { return new TimestampWritable(bytes, offset); } public Object create(Timestamp t) { return new TimestampWritable(t); } } ``` I'm looking for a solution that doesn't need to modify the hive source code. See if you can come up with any good ideas. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself
cdmikechen commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself URL: https://github.com/apache/incubator-hudi/pull/770#issuecomment-532039500 @umehrot2 It is base in 0.4.8 version, You also need to upgrade avro to 1.8.2 or higher version (support logicaltype), and parquet 1.8.2 or higher. My current project is using hoodie 0.4.8. There are still some problems to be adjusted in my project, so I haven't made any improvements yet. By October, I will adjust the PR code based on version 0.5.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services