date:20190731

[GitHub] [incubator-hudi] anchalkataria commented on issue #796: Error hive sync via delta streamer

2019-07-31 Thread GitBox

anchalkataria commented on issue #796: Error hive sync via delta streamer
URL: https://github.com/apache/incubator-hudi/issues/796#issuecomment-517127576
 
 
   > @anchalkataria we have some leads on the null issue. we expect it to be 
fixed on master soon..
   > 
   > on your original registration issue, I actually was able to register 
through delta streamer in the demo setup on master branch... Would you be able 
to give it a shot? I can give you commands..
   
   @vinothchandar So now I am not trying this on local anymore . I am directly 
running the tool on AWS Emr cluster and able to sync data in hive through 
DeltaStreamer.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI

2019-07-31 Thread GitBox

vinothchandar commented on issue #714: Performance Comparison of 
HoodieDeltaStreamer and DataSourceAPI
URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-517125538
 
 
   @NetsanetGeb 2 comes from the configs you are setting? 
   hoodie.upsert.shuffle.parallellism & hoodie.insert.shuffle.parallellism? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #774: Matching question of the version in Spark and Hive2

2019-07-31 Thread GitBox

vinothchandar commented on issue #774: Matching question of the version in 
Spark and Hive2 
URL: https://github.com/apache/incubator-hudi/issues/774#issuecomment-517124662
 
 
   @cdmikechen can we have a call or can you write up how we can take a fresh 
look at the hive sync aspects? It definitely works in certain versions, but 
runs into snags like this with certain versions.. Its a pretty hairy issue IMO 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #789: Demo : Unexpected result in some queries

2019-07-31 Thread GitBox

vinothchandar commented on issue #789: Demo : Unexpected result in some queries
URL: https://github.com/apache/incubator-hudi/issues/789#issuecomment-517124343
 
 
   @n3nash is debugging the join issue, which seems different? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #796: Error hive sync via delta streamer

2019-07-31 Thread GitBox

vinothchandar commented on issue #796: Error hive sync via delta streamer
URL: https://github.com/apache/incubator-hudi/issues/796#issuecomment-517124229
 
 
   @n3nash can you paste the error you got hive syncing on the apache hive 2.x 
servers if any? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #796: Not able to use S3 as storage for Hudi dataset

2019-07-31 Thread GitBox

vinothchandar commented on issue #796: Not able to use S3 as storage for Hudi 
dataset
URL: https://github.com/apache/incubator-hudi/issues/796#issuecomment-517123959
 
 
   @anchalkataria we have some leads on the null issue. we expect it to be 
fixed on master soon.. 
   
   on your original registration issue, I actually was able to register through 
delta streamer in the demo setup on master branch... Would you be able to give 
it a shot? I can give you commands.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #800: Performance tuning

2019-07-31 Thread GitBox

vinothchandar commented on issue #800: Performance tuning
URL: https://github.com/apache/incubator-hudi/issues/800#issuecomment-517123657
 
 
   hi.. any updates? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #801: How to customize schema

2019-07-31 Thread GitBox

vinothchandar commented on issue #801: How to customize schema
URL: https://github.com/apache/incubator-hudi/issues/801#issuecomment-517123554
 
 
   Closing. Reopen new issues on JIRA or mailing list as needed


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar closed issue #801: How to customize schema

2019-07-31 Thread GitBox

vinothchandar closed issue #801: How to customize schema
URL: https://github.com/apache/incubator-hudi/issues/801
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen opened a new issue #817: spark-submit with userClassPathFirst config error

2019-07-31 Thread GitBox

cdmikechen opened a new issue #817: spark-submit with userClassPathFirst config 
error
URL: https://github.com/apache/incubator-hudi/issues/817
 
 
   When I used spark-submit to run some codes like that(spark 2.4.3 and scala 
2.11.12):
   ```
   ../bin/spark-submit --master yarn --class xxx.xxx.Main --conf 
spark.driver.userClassPathFirst=true --conf 
spark.executor.userClassPathFirst=true  --jars 
xxx/hoodie/hoodie-spark-bundle-0.4.8-SNAPSHOT.jar,xxx.jar  xxx/sparkserver.jar
   ```
   Spark reported such an exception when creating spark session to yarn
   ```
   19/08/01 08:49:29 ERROR 
org.apache.spark.network.server.TransportRequestHandler - Error while invoking 
RpcHandler#receive() for one-way message.
   java.lang.ClassCastException: cannot assign instance of 
scala.collection.immutable.Map$Map2 to field 
org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$AddWebUIFilter.filterParams
 of type scala.collection.immutable.Map in instance of 
org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$AddWebUIFilter
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2287)
at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1417)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2293)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:271)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:320)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:270)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:269)
at 
org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:611)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:662)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:654)
at 
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:274)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at

[GitHub] [incubator-hudi] jackwang2 commented on issue #764: Hoodie 0.4.7: Error upserting bucketType UPDATE for partition #, No value present

2019-07-31 Thread GitBox

jackwang2 commented on issue #764: Hoodie 0.4.7:  Error upserting bucketType 
UPDATE for partition #, No value present
URL: https://github.com/apache/incubator-hudi/issues/764#issuecomment-517089256
 
 
   @n3nash No, I didn't. The main logic is for just global deduplication, and
   code is pasted as below:
   
 df.dropDuplicates(recordKey)
   .write
   .format("com.uber.hoodie")
   .mode(SaveMode.Append)
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(HoodieIndexConfig.INDEX_TYPE_PROP,
   HoodieIndex.IndexType.GLOBAL_BLOOM.name)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKey)
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
   partitionCol)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
   DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, storageType)
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, preCombineCol)
   .option("hoodie.consistency.check.enabled", "true")
   .option("hoodie.parquet.small.file.limit", 1024 * 1024 * 128)
   .save(tgtFilePath)
   
   Thanks,
   Jack
   
   On Thu, Aug 1, 2019 at 9:01 AM n3nash  wrote:
   
   > It looks like the "Not an Avro data file" exception is thrown when there
   > is a 0 byte stream read into the datafilereader as can be seen here :
   > 
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L55
   > and here :
   > 
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileConstants.java#L29
   >
   > From the stack trace (by tracing the line numbers), it looks like the
   > CLEAN file is failing to be archived. I looked at the clean logic and we do
   > create clean files even when we don't have anything to clean but that does
   > not result in a 0 bytes file, it still has some valid avro data. I'm
   > wondering if this has anything to do with any sort of race condition
   > leading to archiving running when clean is a 0 sized file.
   >
   > @jackwang2  How are you running the cleaner
   > and the archival process ? Are you explicitly doing anything there ?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > 
,
   > or mute the thread
   > 

   > .
   >
   
   
   -- 
   [image: vshapesaqua11553186012.gif]    *Jianbin Wang*
   Sr. Engineer II, Data
   +86 18633600964
   
   [image: in1552694272.png] 
[image:
   fb1552694203.png]   [image:
   tw1552694330.png]   [image:
   ig1552694392.png] 
   Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on issue #764: Hoodie 0.4.7: Error upserting bucketType UPDATE for partition #, No value present

2019-07-31 Thread GitBox

n3nash commented on issue #764: Hoodie 0.4.7:  Error upserting bucketType 
UPDATE for partition #, No value present
URL: https://github.com/apache/incubator-hudi/issues/764#issuecomment-517076737
 
 
   It looks like the "Not an Avro data file" exception is thrown when there is 
a 0 byte stream read into the datafilereader as can be seen here : 
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L55
 and here : 
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileConstants.java#L29
   
   From the stack trace (by tracing the line numbers), it looks like the CLEAN 
file is failing to be archived. I looked at the clean logic and we do create 
clean files even when we don't have anything to clean but that does not result 
in a 0 bytes file, it still has some valid avro data. I'm wondering if this has 
anything to do with any sort of race condition leading to archiving running 
when clean is a 0 sized file.
   
   @jackwang2 How are you running the cleaner and the archival process ? Are 
you explicitly doing anything there ?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash edited a comment on issue #764: Hoodie 0.4.7: Error upserting bucketType UPDATE for partition #, No value present

2019-07-31 Thread GitBox

n3nash edited a comment on issue #764: Hoodie 0.4.7:  Error upserting 
bucketType UPDATE for partition #, No value present
URL: https://github.com/apache/incubator-hudi/issues/764#issuecomment-517076737
 
 
   It looks like the "Not an Avro data file" exception is thrown when there is 
a 0 byte stream read into the datafilereader as can be seen here : 
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L55
 and here : 
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileConstants.java#L29
   
   From the stack trace (by tracing the line numbers), it looks like the CLEAN 
file is failing to be archived. I looked at the clean logic and we do create 
clean files even when we don't have anything to clean but that does not result 
in a 0 bytes file, it still has some valid avro data. Although we need to fix 
not creating a clean file when there is nothing to clean, this still doesn't 
result into the error. I'm wondering if this has anything to do with any sort 
of race condition leading to archiving running when clean is a 0 sized file.
   
   @jackwang2 How are you running the cleaner and the archival process ? Are 
you explicitly doing anything there ?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] tweise commented on issue #816: [HUDI-121] Add lresende signing key to KEYS file

2019-07-31 Thread GitBox

tweise commented on issue #816: [HUDI-121] Add lresende signing key to KEYS file
URL: https://github.com/apache/incubator-hudi/pull/816#issuecomment-517066426
 
 
   The KEYS file needs to be added to the dist area, not here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #816: [HUDI-121] Add lresende signing key to KEYS file

2019-07-31 Thread GitBox

bvaradar commented on a change in pull request #816: [HUDI-121] Add lresende 
signing key to KEYS file
URL: https://github.com/apache/incubator-hudi/pull/816#discussion_r309469510
 
 

 ##
 File path: KEYS
 ##
 @@ -126,3 +126,286 @@ 
txTq7YpleWQhcz9+9Fruu7jA+l1pSUJSR0+DZegBOq+zWIHcZSTbAnfOX+jYySYd
 lsw/
 =GJFW
 -END PGP PUBLIC KEY BLOCK-
+pub   dsa1024 2007-06-30 [SC]
+  50D7C82AC19334EA3C75699AF39F187DEFB55DF1
+uid   [ unknown] Luciano Resende (Code Signing Key) 

+sig 3F39F187DEFB55DF1 2007-06-30  Luciano Resende (Code Signing Key) 

+sig 2F4268EE7F2EFD0F0 2010-11-05  Christopher David Schultz 
(Christopher David Schultz) 
 
 Review comment:
   @lresende : Not sure if sig entries for other users must be present. Can  
you remove them if not needed ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lresende opened a new pull request #816: [HUDI-121] Add lresende signing key to KEYS file

2019-07-31 Thread GitBox

lresende opened a new pull request #816: [HUDI-121] Add lresende signing key to 
KEYS file
URL: https://github.com/apache/incubator-hudi/pull/816
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

svn commit: r35085 - /dev/incubator/hudi/

2019-07-31 Thread lresende

Author: lresende
Date: Wed Jul 31 22:45:37 2019
New Revision: 35085

Log:
Adding release staging directory for Hudi

Added:
dev/incubator/hudi/

svn commit: r35084 - /release/incubator/hudi/

2019-07-31 Thread lresende

Author: lresende
Date: Wed Jul 31 22:44:48 2019
New Revision: 35084

Log:
Adding release directory for Hudi

Added:
release/incubator/hudi/

[GitHub] [incubator-hudi] vinothchandar closed pull request #803: [WIP] [Not For Merging] Demo automation with pom dep order fixes from PR-780

2019-07-31 Thread GitBox

vinothchandar closed pull request #803: [WIP] [Not For Merging] Demo automation 
with pom dep order fixes from PR-780
URL: https://github.com/apache/incubator-hudi/pull/803
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #803: [WIP] [Not For Merging] Demo automation with pom dep order fixes from PR-780

2019-07-31 Thread GitBox

vinothchandar commented on issue #803: [WIP] [Not For Merging] Demo automation 
with pom dep order fixes from PR-780
URL: https://github.com/apache/incubator-hudi/pull/803#issuecomment-517043325
 
 
   have this code and #780 both testing in `pom-bundle-cleanup` branch.. 
Closing this. Will open a new one when ready. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #815: HUDI-186 Fix formatting for new content in Writing Data page. Update website to reflect new apache links

2019-07-31 Thread GitBox

vinothchandar commented on issue #815: HUDI-186 Fix formatting for new content 
in Writing Data page. Update website to reflect new apache links
URL: https://github.com/apache/incubator-hudi/pull/815#issuecomment-517042646
 
 
   No. have not seen them.. its auto generated content, so may be it reflects 
the localhost name or ip (0.0.0.0).. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar merged pull request #815: HUDI-186 Fix formatting for new content in Writing Data page. Update website to reflect new apache links

2019-07-31 Thread GitBox

vinothchandar merged pull request #815: HUDI-186 Fix formatting for new content 
in Writing Data page. Update website to reflect new apache links
URL: https://github.com/apache/incubator-hudi/pull/815
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated: Fix formatting for new content in Writing Data page. Update hudi.incubator.apache.org website (#815)

2019-07-31 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 3190d6d  Fix formatting for new content in Writing Data page. Update 
hudi.incubator.apache.org website (#815)
3190d6d is described below

commit 3190d6d59f2292f265e5bdbf6cffb40264c5252d
Author: Balaji Varadarajan 
AuthorDate: Wed Jul 31 15:15:00 2019 -0700

Fix formatting for new content in Writing Data page. Update 
hudi.incubator.apache.org website (#815)
---
 content/404.html | 13 +-
 content/admin_guide.html | 13 +-
 content/community.html   | 13 +-
 content/comparison.html  | 13 +-
 content/concepts.html| 13 +-
 content/configurations.html  | 13 +-
 content/contributing.html| 13 +-
 content/docker_demo.html | 13 +-
 content/feed.xml | 20 
 content/gcs_hoodie.html  | 13 +-
 content/index.html   | 13 +-
 content/js/mydoc_scroll.html | 13 +-
 content/migration_guide.html | 13 +-
 content/news.html| 13 +-
 content/news_archive.html| 13 +-
 content/performance.html | 13 +-
 content/powered_by.html  | 13 +-
 content/privacy.html | 13 +-
 content/querying_data.html   | 13 +-
 content/quickstart.html  | 13 +-
 content/s3_hoodie.html   | 13 +-
 content/sitemap.xml  | 48 ++---
 content/strata-talk.html | 15 ++--
 content/use_cases.html   | 13 +-
 content/writing_data.html| 56 ++--
 docs/writing_data.md | 14 ++-
 26 files changed, 356 insertions(+), 70 deletions(-)

diff --git a/content/404.html b/content/404.html
index 07dc2e8..dd7c740 100644
--- a/content/404.html
+++ b/content/404.html
@@ -46,7 +46,7 @@
 https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js";>

[GitHub] [incubator-hudi] n3nash commented on issue #814: Fix for realtime queries

2019-07-31 Thread GitBox

n3nash commented on issue #814: Fix for realtime queries
URL: https://github.com/apache/incubator-hudi/pull/814#issuecomment-517026353
 
 
   On another note, I debugged the issue with join queries as reported here : 
https://github.com/apache/incubator-hudi/issues/789 and found weird results 
(nothing to do with the change in this PR or the hive on spark fix). 
Essentially, looks like due to some hive join optimizations, only 1 table 
format gets picked up (either hoodierealtimeinputformat or hoodieinputformat) 
when joining 2 tables with different table formats. May be some bug on our end, 
have to dig deeper and will open a different ticket around it. This is just FYI


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on issue #814: Fix for realtime queries

2019-07-31 Thread GitBox

n3nash commented on issue #814: Fix for realtime queries
URL: https://github.com/apache/incubator-hudi/pull/814#issuecomment-517025307

The code being removed was added to make Hive on Spark work. Due to a bug in
Hive, Hive on Spark does not work seamlessly with RT tables.

> Issue with Hive on Spark (that was fixed by caching some information) :

Hive on Spark allows for multiple tasks to run in the same executor. Since
the executor/JVM' lifetime is longer than 1 task, the job conf variable is
shared across different file splits. Due to a bug in hive (find in the comments
of HoodieRealtimeInputFormat class), the columnids and columnnames are messed
up. The same columnids and names are added multiple times to the same key in
job conf. As a workaround this : https://issues.apache.org/jira/browse/HUDI-151
was added. But this leads to some other issues which results in breaking the RT
queries.

> Current issue :

Ideally, a single query in Hive either starts a MapReduce job or a Spark
job. Once the query finishes, the mapper/reduces or the spark tasks die. As a
result, this caching of column_ids and column_names does not carry across
different queries. Hence, this works fine in production environments at the
moment.
In the case of the demo, it seems like this cache is somehow kept across
different queries.

> Steps to reproduce the issue :

_Execute the following sequence in step 4(a)_

`0: jdbc:hive2://hiveserver:1> select symbol, max(ts) from
stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';`

When this is done, the COLUMN_NAMES and COLUMN_IDS for this query is cached
as follows :

`
COLUMN_NAMES ==>
ts,symbol,_hoodie_record_key,_hoodie_commit_time,_hoodie_partition_path
COLUMN_IDS ==> 6,7,2,0,3
`

_Now run the second query :_

`0: jdbc:hive2://hiveserver:1> select `_hoodie_commit_time`, symbol, ts,
volume, open, close from stock_ticks_mor_rt where symbol = 'GOOG';`

At this time, although the projection cols are as follows :
`
Projection Column Names =>
_hoodie_commit_time,volume,ts,symbol,close,open,_hoodie_record_key,_hoodie_partition_path
Projection Column Ids => 0,5,6,7,14,15,2,3
`
due to the fact that we cache these values (to fix hive on spark), these
values are replaced with the earlier values (see
[here](https://github.com/apache/incubator-hudi/blob/master/hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/realtime/HoodieRealtimeInputFormat.java#L243))
:
`
COLUMN_NAMES ==>
ts,symbol,_hoodie_record_key,_hoodie_commit_time,_hoodie_partition_path
COLUMN_IDS ==> 6,7,2,0,3
`
Notice that the column names and ids for columns volume, open & close are
omitted (since they were not part of the first query and hence the cached
values don't have it). Hence these columns are never read/projected and return
NULL.

This happens intermittently and cannot be reproduced everytime. My suspicion
is that it has to do with which datanode the query runs in and if the datanode
caches the job conf but I'm not very sure about this.

In any case, I'm reverting the change made to have Hive on Spark work. This
means Hive on Spark queries will be broken in RT (for some specific types of
queries, not all). Since the docker image does not have a way to debug hive on
spark queries, I'm figuring out an environment where I can do this (the
internal environment that i was using earlier is broken) after which I will
find a permanent fix that will not break RT queries and also make hive on spark
run.

@vinothchandar
@bvaradar
@bhasudha
FYI

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #815: HUDI-186 Fix formatting for new content in Writing Data page. Update website to reflect new apache links

2019-07-31 Thread GitBox

bvaradar commented on issue #815: HUDI-186 Fix formatting for new content in 
Writing Data page. Update website to reflect new apache links
URL: https://github.com/apache/incubator-hudi/pull/815#issuecomment-517022364
 
 
   @vinothchandar @n3nash : Fixed some doc formatting in Writing Data page and 
updating the website to make Whimsy website check green. 
   
   I see some additional changes replacing 0.0.0.0 with localhost. Have you 
seen this before. Guess this is due to version change in the tool - bundle.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar opened a new pull request #815: HUDI-186 Fix formatting for new content in Writing Data page. Update website to reflect new apache links

2019-07-31 Thread GitBox

bvaradar opened a new pull request #815: HUDI-186 Fix formatting for new 
content in Writing Data page. Update website to reflect new apache links
URL: https://github.com/apache/incubator-hudi/pull/815
 
 
   HUDI-186 Fix formatting for new content in Writing Data page. Update website 
to reflect new apache links


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash opened a new pull request #814: Fix for realtime queries

2019-07-31 Thread GitBox

n3nash opened a new pull request #814: Fix for realtime queries
URL: https://github.com/apache/incubator-hudi/pull/814
 
 
   - Fix realtime queries by removing COLUMN_ID and COLUMN_NAME cache in 
inputformat
   - These variables were cached to make Hive on Spark work


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated: HUDI-186 : Add missing Apache Links in hudi site

2019-07-31 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 018802c  HUDI-186 : Add missing Apache Links in hudi site
018802c is described below

commit 018802ca7a2f97a267a4b8a99b0e97bfa6444362
Author: Balaji Varadarajan 
AuthorDate: Wed Jul 31 10:50:30 2019 -0700

HUDI-186 : Add missing Apache Links in hudi site
---
 docs/Gemfile.lock  |  2 +-
 docs/_includes/footer.html | 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/docs/Gemfile.lock b/docs/Gemfile.lock
index b72b9b1..acdd9db 100644
--- a/docs/Gemfile.lock
+++ b/docs/Gemfile.lock
@@ -153,4 +153,4 @@ DEPENDENCIES
   jekyll-feed (~> 0.6)
 
 BUNDLED WITH
-   1.14.3
+   2.0.1
diff --git a/docs/_includes/footer.html b/docs/_includes/footer.html
index ed02cf6..d1a77c0 100755
--- a/docs/_includes/footer.html
+++ b/docs/_includes/footer.html
@@ -15,5 +15,16 @@
   reflection of the completeness or stability of the code, it 
does indicate that the project has yet to be fully endorsed by the ASF.
   
 
+
+  
+https://incubator.apache.org/;> Apache Incubator 
+https://www.apache.org/;> About the ASF 
+https://www.apache.org/events/current-event;> Events 
+https://www.apache.org/foundation/thanks.html;> Thanks 
+https://www.apache.org/foundation/sponsorship.html;> Become a Sponsor 
+https://www.apache.org/security/;> Security 
+https://www.apache.org/licenses/;> License 
+  
+

[GitHub] [incubator-hudi] n3nash merged pull request #813: HUDI-186 : Add missing Apache Links in hudi site

2019-07-31 Thread GitBox

n3nash merged pull request #813: HUDI-186 : Add missing Apache Links in hudi 
site
URL: https://github.com/apache/incubator-hudi/pull/813
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #813: HUDI-186 : Add missing Apache Links in hudi site

2019-07-31 Thread GitBox

bvaradar commented on issue #813: HUDI-186 : Add missing Apache Links in hudi 
site
URL: https://github.com/apache/incubator-hudi/pull/813#issuecomment-517005269
 
 
   @vinothchandar @n3nash : Checked by running the website locally. Needed for 
making website checks all green. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar opened a new pull request #813: HUDI-186 : Add missing Apache Links in hudi site

2019-07-31 Thread GitBox

bvaradar opened a new pull request #813: HUDI-186 : Add missing Apache Links in 
hudi site
URL: https://github.com/apache/incubator-hudi/pull/813
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on issue #812: KryoException: Unable to find class

2019-07-31 Thread GitBox

n3nash commented on issue #812: KryoException: Unable to find class
URL: https://github.com/apache/incubator-hudi/issues/812#issuecomment-516965553
 
 
   This looks more like a spark issue. So whenever spark shuffles data, if you 
choose kryo for serialization, one has to register java objects with kryo under 
a name, it looks like kryo is unable to find that object under the name and 
hence throws class not found. Does your application change over time in any way 
? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] NetsanetGeb edited a comment on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI

2019-07-31 Thread GitBox

NetsanetGeb edited a comment on issue #714: Performance Comparison of
HoodieDeltaStreamer and DataSourceAPI
URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-516753477

After i used hoodie 0.4.6 version, the performance improved and now its
taking 4 minutes.

![per2](https://user-images.githubusercontent.com/25975892/62195353-1c158600-b37c-11e9-905f-0b04213e614f.png)

I also added a similar code to the countByKey for counting the records in
the HoodieDeltaStreamer class and check why its taking long in the
HoodieBloomIndex and it took about 9 seconds. While the countByKey of the
HoodieBloomIndex is still taking 39 seconds. This change seems to occur due
to parallelism because on the first countByKey it have 22 and on the
HoodieBloomIndex its 2 as observed from the Spark UI below.

![per1](https://user-images.githubusercontent.com/25975892/62196336-1caf1c00-b37e-11e9-89f1-894387485ec7.png)

The effect is clearly seen as we increase the size of the input data from 2
GB to 27 GB. For stage 2, 3, and 4, it was using the 90 executors as provided
and decreases it accordingly. While for stage 5, only 2 executors were running
from the start.

![per3](https://user-images.githubusercontent.com/25975892/62214909-3f552b00-b3a6-11e9-92b5-df197378795d.png)

How do we enhance the parallelism of the bloom index since hoodie is
calculating the parallelism for bloom index inside without the need to set it
as a configuration?
In general, are there specific ways to enhance the performance of bloom
indexing?

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-hudi] NetsanetGeb edited a comment on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI

2019-07-31 Thread GitBox

NetsanetGeb edited a comment on issue #714: Performance Comparison of
HoodieDeltaStreamer and DataSourceAPI
URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-516753477

After i used hoodie 0.4.6 version, the performance improved and now its
taking 4 minutes.

![per2](https://user-images.githubusercontent.com/25975892/62195353-1c158600-b37c-11e9-905f-0b04213e614f.png)

I also added a similar code to the countByKey for counting the records in
the HoodieDeltaStreamer class and check why its taking long in the
HoodieBloomIndex and it took about 9 seconds. While the countByKey of the
HoodieBloomIndex is still taking 39 seconds. This change seems to occur due
to parallelism because on the first count it have 22 and on the HoodieBloom
index its 2 as observed from the Spark UI below.