[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-21 Thread GitBox
n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268040070
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -24,8 +29,14 @@ Such key activities include
  MergeOnRead storage type of dataset
  * `COMPACTIONS` - Background activity to reconcile differential data 
structures within Hudi e.g: moving updates from row based log files to columnar 
formats.
 
+Any given instant can be 
+in one of the following states
+
+ * `REQUESTED` - Denotes an action has been scheduled, but has not begun yet
 
 Review comment:
   nit : s/but has not begun yet/but has not been initiated


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-21 Thread GitBox
n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268039634
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -7,15 +7,20 @@ toc: false
 summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following primitives over 
datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
 
  * Upsert (how do I change the dataset?)
- * Incremental consumption(how do I fetch data that changed?)
+ * Incremental pull   (how do I fetch data that changed?)
 
+In this section, we will discuss key concepts & terminologies that are 
important to understand, to be able to effectively use these primitives.
 
-In order to achieve this, Hudi maintains a `timeline` of all activity 
performed on the dataset, that helps provide `instantaenous` views of the 
dataset,
-while also efficiently supporting retrieval of data in the order of arrival 
into the dataset.
-Such key activities include
+## Timeline
+At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time that helps provide instantaenous views 
of the dataset,
 
 Review comment:
   'At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time. This helps provide instantaneous views 
of the dataset while also supporting efficient retrieval of changed data in the 
order of arrival`. A Hudi instant is uniquely defined by 3 components namely, 
the action type, the instant time at which it started and it's current state...
   Note the spelling of instantaneous as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-21 Thread GitBox
n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268039634
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -7,15 +7,20 @@ toc: false
 summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following primitives over 
datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
 
  * Upsert (how do I change the dataset?)
- * Incremental consumption(how do I fetch data that changed?)
+ * Incremental pull   (how do I fetch data that changed?)
 
+In this section, we will discuss key concepts & terminologies that are 
important to understand, to be able to effectively use these primitives.
 
-In order to achieve this, Hudi maintains a `timeline` of all activity 
performed on the dataset, that helps provide `instantaenous` views of the 
dataset,
-while also efficiently supporting retrieval of data in the order of arrival 
into the dataset.
-Such key activities include
+## Timeline
+At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time that helps provide instantaenous views 
of the dataset,
 
 Review comment:
   'At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time.This helps provide instantaneous views 
of the dataset`.
   Note the spelling of instantaneous as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #610: Major cleanup of docs structure/content

2019-03-21 Thread GitBox
n3nash commented on a change in pull request #610: Major cleanup of docs 
structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268039634
 
 

 ##
 File path: docs/concepts.md
 ##
 @@ -7,15 +7,20 @@ toc: false
 summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following primitives over 
datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
 
  * Upsert (how do I change the dataset?)
- * Incremental consumption(how do I fetch data that changed?)
+ * Incremental pull   (how do I fetch data that changed?)
 
+In this section, we will discuss key concepts & terminologies that are 
important to understand, to be able to effectively use these primitives.
 
-In order to achieve this, Hudi maintains a `timeline` of all activity 
performed on the dataset, that helps provide `instantaenous` views of the 
dataset,
-while also efficiently supporting retrieval of data in the order of arrival 
into the dataset.
-Such key activities include
+## Timeline
+At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time that helps provide instantaenous views 
of the dataset,
 
 Review comment:
   'At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time.This helps provide instantaneous views 
of the dataset while..`.
   Note the spelling of instantaneous as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ambition119 commented on a change in pull request #608: [HUDI-63] Removed unused BucketedIndex code

2019-03-21 Thread GitBox
ambition119 commented on a change in pull request #608: [HUDI-63] Removed 
unused BucketedIndex code
URL: https://github.com/apache/incubator-hudi/pull/608#discussion_r268031907
 
 

 ##
 File path: 
hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFormatVersion.java
 ##
 @@ -80,18 +80,15 @@ public boolean hasHeader() {
 
   @Override
   public boolean hasFooter() {
-switch (super.getVersion()) {
-  case DEFAULT_VERSION:
-return false;
-  case 1:
-return true;
-  default:
-return false;
-}
+return hasLogFooterAndBlockLength();
 
 Review comment:
   I am doing optimization when researching and learning the code. If this 
content is necessary, can I add another PR?
   
   thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ambition119 commented on issue #603: [HUDI-63] Removed unused BucketedIndex code

2019-03-21 Thread GitBox
ambition119 commented on issue #603: [HUDI-63] Removed unused BucketedIndex code
URL: https://github.com/apache/incubator-hudi/pull/603#issuecomment-475485921
 
 
   > Hmmm I usually do the following workflow.
   > 
   > ```
   > git fetch origin
   > git fetch hudi # replace with whatever is your remote pointing to 
apache/incubator-hudi
   > git checkout master
   > git rebase hudi/master
   > git checkout your-branch
   > git rebase master # resolve conflicts if any, run tests again
   > git push origin your-branch --force # this updates the PR
   > ```
   > @ambition119 does that help
   
   For almost the same operation, the commands I use are as follows:
   ```shell
   $ git remote add apache https://github.com/apache/incubator-hudi.git
   $ git fetch apache
   $ git rebase apache/master
   $ git push // or git push origin master
   $ git checkout -b hudi_63
   ```
   then hudi_63 branch dev and pull request merge.  thank you!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ambition119 commented on a change in pull request #608: [HUDI-63] Removed unused BucketedIndex code

2019-03-21 Thread GitBox
ambition119 commented on a change in pull request #608: [HUDI-63] Removed 
unused BucketedIndex code
URL: https://github.com/apache/incubator-hudi/pull/608#discussion_r268030230
 
 

 ##
 File path: 
hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFormatVersion.java
 ##
 @@ -80,18 +80,15 @@ public boolean hasHeader() {
 
   @Override
   public boolean hasFooter() {
-switch (super.getVersion()) {
-  case DEFAULT_VERSION:
-return false;
-  case 1:
-return true;
-  default:
-return false;
-}
+return hasLogFooterAndBlockLength();
 
 Review comment:
   > while I am in favor of cleanups, can we keep this PR to just removing 
BucketedIndex related changes alone? we can open a JIRA for these fixes.. wdyt?
   
   ok


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ambition119 commented on a change in pull request #608: [HUDI-63] Removed unused BucketedIndex code

2019-03-21 Thread GitBox
ambition119 commented on a change in pull request #608: [HUDI-63] Removed 
unused BucketedIndex code
URL: https://github.com/apache/incubator-hudi/pull/608#discussion_r268030187
 
 

 ##
 File path: 
hoodie-common/src/main/java/com/uber/hoodie/common/table/log/block/HoodieAvroDataBlock.java
 ##
 @@ -115,26 +115,7 @@ public static HoodieLogBlock getBlock(HoodieLogFile 
logFile,
 
 // 3. Write the records
 Iterator itr = records.iterator();
-while (itr.hasNext()) {
-  IndexedRecord s = itr.next();
-  ByteArrayOutputStream temp = new ByteArrayOutputStream();
-  Encoder encoder = EncoderFactory.get().binaryEncoder(temp, null);
-  try {
-// Encode the record into bytes
-writer.write(s, encoder);
-encoder.flush();
-
-// Get the size of the bytes
-int size = temp.toByteArray().length;
-// Write the record size
-output.writeInt(size);
-// Write the content
-output.write(temp.toByteArray());
-itr.remove();
-  } catch (IOException e) {
-throw new HoodieIOException("IOException converting 
HoodieAvroDataBlock to bytes", e);
-  }
-}
+writerIndexedRecord(writer, output, itr);
 
 Review comment:
   > same here.. please just limit this PR to BucketedIndex removal
   
   ok, I fix this content.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #608: [HUDI-63] Removed unused BucketedIndex code

2019-03-21 Thread GitBox
vinothchandar commented on a change in pull request #608: [HUDI-63] Removed 
unused BucketedIndex code
URL: https://github.com/apache/incubator-hudi/pull/608#discussion_r268021296
 
 

 ##
 File path: 
hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFormatVersion.java
 ##
 @@ -80,18 +80,15 @@ public boolean hasHeader() {
 
   @Override
   public boolean hasFooter() {
-switch (super.getVersion()) {
-  case DEFAULT_VERSION:
-return false;
-  case 1:
-return true;
-  default:
-return false;
-}
+return hasLogFooterAndBlockLength();
 
 Review comment:
   while I am in favor of cleanups, can we keep this PR to just removing 
BucketedIndex related changes alone? we can open a JIRA for these fixes.. wdyt? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #603: [HUDI-63] Removed unused BucketedIndex code

2019-03-21 Thread GitBox
vinothchandar commented on issue #603: [HUDI-63] Removed unused BucketedIndex 
code
URL: https://github.com/apache/incubator-hudi/pull/603#issuecomment-475472507
 
 
   Hmmm I usually do the following workflow. 
   
   ```
   git fetch origin
   git fetch hudi # replace with whatever is your remote pointing to 
apache/incubator-hudi
   git checkout master
   git rebase hudi/master
   git checkout your-branch
   git rebase master # resolve conflicts if any, run tests again
   git push origin your-branch --force # this updates the PR
   ```
   
   @ambition119 does that help 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] tooptoop4 edited a comment on issue #502: Support "analyze table x compute statistics [for columns]" in Hive

2019-03-21 Thread GitBox
tooptoop4 edited a comment on issue #502: Support "analyze table x compute 
statistics [for columns]" in Hive
URL: https://github.com/apache/incubator-hudi/issues/502#issuecomment-475437951
 
 
   @vinothchandar can you share the patch PR? is it related to 
https://issues.apache.org/jira/browse/HIVE-11266 ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] tooptoop4 commented on issue #502: Support "analyze table x compute statistics [for columns]" in Hive

2019-03-21 Thread GitBox
tooptoop4 commented on issue #502: Support "analyze table x compute statistics 
[for columns]" in Hive
URL: https://github.com/apache/incubator-hudi/issues/502#issuecomment-475437951
 
 
   @vinothchandar can you share the patch PR?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #143: Tracking ticket for folks to be added to slack group

2019-03-21 Thread GitBox
vinothchandar commented on issue #143: Tracking ticket for folks to be added to 
slack group
URL: https://github.com/apache/incubator-hudi/issues/143#issuecomment-475317483
 
 
   @Zhujun-Vungle Done.. Also please join our mailing list, where most of the 
conversations happen these days. https://hudi.apache.org/community.html :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: run_hive_sync tool must be able to handle case where there are multiple standalone jdbc jars in hive installation dir

2019-03-21 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 194d904  run_hive_sync tool must be able to handle case where there 
are multiple standalone jdbc jars in hive installation dir
194d904 is described below

commit 194d904c99ebd013af55eac7509e3e79193dce77
Author: Balaji Varadarajan 
AuthorDate: Thu Mar 21 09:03:36 2019 -0700

run_hive_sync tool must be able to handle case where there are multiple 
standalone jdbc jars in hive installation dir
---
 hoodie-hive/run_sync_tool.sh | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/hoodie-hive/run_sync_tool.sh b/hoodie-hive/run_sync_tool.sh
index 910b4c5..1adae9e 100755
--- a/hoodie-hive/run_sync_tool.sh
+++ b/hoodie-hive/run_sync_tool.sh
@@ -23,13 +23,13 @@ if [ -z "$HADOOP_CONF_DIR" ]; then
 fi
 
 ## Include only specific packages from HIVE_HOME/lib to avoid version 
mismatches
-HIVE_EXEC=`ls ${HIVE_HOME}/lib/hive-exec-*.jar`
-HIVE_SERVICE=`ls ${HIVE_HOME}/lib/hive-service-*.jar | grep -v rpc`
-HIVE_METASTORE=`ls ${HIVE_HOME}/lib/hive-metastore-*.jar`
+HIVE_EXEC=`ls ${HIVE_HOME}/lib/hive-exec-*.jar | tr '\n' ':'`
+HIVE_SERVICE=`ls ${HIVE_HOME}/lib/hive-service-*.jar | grep -v rpc | tr '\n' 
':'`
+HIVE_METASTORE=`ls ${HIVE_HOME}/lib/hive-metastore-*.jar | tr '\n' ':'`
 # Hive 1.x/CDH has standalone jdbc jar which is no longer available in 2.x
-HIVE_JDBC=`ls ${HIVE_HOME}/lib/hive-jdbc-*standalone*.jar`
+HIVE_JDBC=`ls ${HIVE_HOME}/lib/hive-jdbc-*standalone*.jar | tr '\n' ':'`
 if [ -z "${HIVE_JDBC}" ]; then
-  HIVE_JDBC=`ls ${HIVE_HOME}/lib/hive-jdbc-*.jar | grep -v handler`
+  HIVE_JDBC=`ls ${HIVE_HOME}/lib/hive-jdbc-*.jar | grep -v handler | tr '\n' 
':'`
 fi
 HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_SERVICE:$HIVE_JDBC
 



[GitHub] [incubator-hudi] bvaradar commented on issue #581: ClassNotFoundException:HoodieInputFormat

2019-03-21 Thread GitBox
bvaradar commented on issue #581: ClassNotFoundException:HoodieInputFormat
URL: https://github.com/apache/incubator-hudi/issues/581#issuecomment-475294724
 
 
   @daikon12 : 
   
   The run_hive_sync tool failed because there were multiple standalone jdbc 
jars found in your hive installation. We have not seen this case in our 
installations. I have created a PR 
(https://github.com/apache/incubator-hudi/pull/609) to handle this case. Can 
you please apply this patch, try this out and let us know if this solved the 
issue.
   
   Thanks,
   Balaji.V


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services