Build failed in Jenkins: hudi-snapshot-deployment-0.5 #51

2019-09-27 Thread Apache Jenkins Server
See 


--
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on H40 (ubuntu xenial) in workspace 

No credentials specified
Wiping out workspace first.
Cloning the remote Git repository
Using shallow clone
Cloning repository https://github.com/apache/incubator-hudi.git
 > git init  # 
 > timeout=10
Fetching upstream changes from https://github.com/apache/incubator-hudi.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/incubator-hudi.git 
 > +refs/heads/*:refs/remotes/origin/* --depth=1
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Error performing git command
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2051)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1761)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$400(CliGitAPIImpl.java:72)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:442)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$2.execute(CliGitAPIImpl.java:655)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:212)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:369)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to 
H40
at 
hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
at 
hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
at hudson.remoting.Channel.call(Channel.java:955)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
at sun.reflect.GeneratedMethodAccessor1084.invoke(Unknown 
Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
at com.sun.proxy.$Proxy135.execute(Unknown Source)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1152)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1192)
at hudson.scm.SCM.checkout(SCM.java:504)
at 
hudson.model.AbstractProject.checkout(AbstractProject.java:1208)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at 
jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1810)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at 
hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at hudson.Proc$LocalProc.(Proc.java:281)
at hudson.Proc$LocalProc.(Proc.java:218)
at hudson.Launcher$LocalLauncher.launch(Launcher.java:936)
at hudson.Launcher$ProcStarter.start(Launcher.java:455)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2038)
... 14 more
ERROR: Error cloning remote repo 'origin'
Retrying after 10 seconds
No credentials specified
Wiping out workspace first.
Cloning the remote Git repository
Using shallow clone
Cloning repository https://github.com/apache/incubator-hudi.git
 > git init  # 
 > timeout=10
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Could not init 

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #928: [HUDI-265] Failed to 
delete tmp dirs created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928#discussion_r329293645
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/table/log/HoodieLogFormatTest.java
 ##
 @@ -78,6 +80,8 @@
 @RunWith(Parameterized.class)
 public class HoodieLogFormatTest {
 
+  @Rule
+  public TemporaryFolder folder = new TemporaryFolder();
 
 Review comment:
   go for it! :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #915: [HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle

2019-09-27 Thread GitBox
umehrot2 commented on issue #915: [HUDI-268] Shade and relocate Avro dependency 
in hadoop-mr-bundle
URL: https://github.com/apache/incubator-hudi/pull/915#issuecomment-536140345
 
 
   > @vinothchandar @bvaradar updated the PR
   
   
   
   > Just few nits.. But the integration test seems to fail? ideas?
   
   @vinothchandar I noticed an issue with this approach today, and probably 
thats why the integration tests are failing.
   
   When we will build with the profile `shade-avro` things would be file, but 
when we build without this profile it can cause other things to break. Because 
we have always added an `` for the avro dependency as well as 
`` for avro and are just trying to turn shading on or off using 
`scope`. However, when the scope is `provided`, due to our inclusion of 
`relocation` section all references to `org.apache.avro` in our code/jar would 
be replaced with `org.apache.hudi.org.apache.avro`. However, it will not 
actually find that relocated dependency because it has not actually been 
shaded/relocated because of the scope. This will cause runtime errors.
   
   Essentially, now I am not able to find a good way how to activate/deactivate 
the shading. Only way now I can think of is adding the whole `shade plugin` 
with all its contents and the `avro` inside the profile. But that seems like a 
bad approach to me.
   
   Have u guys achieved anything like this in the past ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-285) Implement HoodieStorageWriter based on the metadata

2019-09-27 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939810#comment-16939810
 ] 

leesf commented on HUDI-285:


Yes, we can get the actual type of the file from the path parameter, and then 
reuse the HoodieFileFormat.

> Implement HoodieStorageWriter based on the metadata
> ---
>
> Key: HUDI-285
> URL: https://issues.apache.org/jira/browse/HUDI-285
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Write Client
>Reporter: leesf
>Assignee: leesf
>Priority: Major
> Fix For: 0.5.1
>
>
> Currently the _getStorageWriter_ method in HoodieStorageWriterFactory to get 
> HoodieStorageWriter is hard code to HoodieParquetWriter since currently only 
> parquet is supported for HoodieStorageWriter. However, it is better to 
> implement HoodieStorageWriter based on the metadata for extension. And if 
> _StorageWriterType_ is emtpy in metadata, the default HoodieParquetWriter is 
> returned to not affect the current logic.
> cc [~vinoth] [~vbalaji]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf commented on a change in pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-27 Thread GitBox
leesf commented on a change in pull request #928: [HUDI-265] Failed to delete 
tmp dirs created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928#discussion_r329287508
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/table/log/HoodieLogFormatTest.java
 ##
 @@ -78,6 +80,8 @@
 @RunWith(Parameterized.class)
 public class HoodieLogFormatTest {
 
+  @Rule
+  public TemporaryFolder folder = new TemporaryFolder();
 
 Review comment:
   Since hoodie-client module relies on hoodie-common module, I think we can 
make HoodieClientTestHarness extend HoodieCommonTestHarness, and move some 
methods from HoodieClientTestHarness to HoodieCommonTestHarness. Then make some 
common test units extend from HoodieCommonTestHarness. CC @vinothchandar 
@yanghua 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on a change in pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-27 Thread GitBox
leesf commented on a change in pull request #928: [HUDI-265] Failed to delete 
tmp dirs created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928#discussion_r329287508
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/table/log/HoodieLogFormatTest.java
 ##
 @@ -78,6 +80,8 @@
 @RunWith(Parameterized.class)
 public class HoodieLogFormatTest {
 
+  @Rule
+  public TemporaryFolder folder = new TemporaryFolder();
 
 Review comment:
   Thanks for the comments. I will update the PR soon.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on a change in pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-27 Thread GitBox
leesf commented on a change in pull request #928: [HUDI-265] Failed to delete 
tmp dirs created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928#discussion_r329287463
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/util/TestFSUtils.java
 ##
 @@ -129,6 +129,7 @@ public void testProcessFiles() throws Exception {
 .noneMatch(s -> s.contains(HoodieTableMetaClient.METAFOLDER_NAME)));
 // Check if only files are listed including hoodie.properties
 Assert.assertEquals("Collected=" + collected2, 5, collected2.size());
+tmpFolder.delete();
 
 Review comment:
   wont


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on a change in pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-27 Thread GitBox
leesf commented on a change in pull request #928: [HUDI-265] Failed to delete 
tmp dirs created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928#discussion_r329287433
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/table/log/HoodieLogFormatTest.java
 ##
 @@ -78,6 +80,8 @@
 @RunWith(Parameterized.class)
 public class HoodieLogFormatTest {
 
+  @Rule
 
 Review comment:
   Yes


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on a change in pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-27 Thread GitBox
leesf commented on a change in pull request #928: [HUDI-265] Failed to delete 
tmp dirs created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928#discussion_r329287415
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
 ##
 @@ -968,6 +968,7 @@ public void testInsertsGeneratedIntoLogFilesRollback() 
throws Exception {
   Thread.sleep(1000);
   // Rollback again to pretend the first rollback failed partially. This 
should not error our
   writeClient.rollback(newCommitTime);
+  folder.delete();
 
 Review comment:
   Won't since it is a variable in the method.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-27 Thread Xing Pan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939797#comment-16939797
 ] 

Xing Pan edited comment on HUDI-269 at 9/28/19 12:23 AM:
-

[~vinoth]
 I thought it's because of that delta streamer too aggressive too, so I add the 
throttle param to control this, and it helped.
 now the issue is:
 I do significantly reduce requests count by set delta streamer *throttle to 5 
seconds*.
 * when there is *nothing* coming from kafka, request matrix looks acceptable 
both in data source writer way and delta streamer way. (if I don't have 
throttle, delta streamer will send too many requests even no input comming)
 * But if I have kafka input streaming like 10 records per second, I found that 
even if I set the 5 seconds throttle, writing hudi with delta stream will cause 
10 times more request than do it in data source writer way.
 so I would choose data source writer anyway since in this way I always save 
requests count.

so now I am wondering what are the pro and cons for me to choose between spark 
datasource and delta streamer.
 as far as I can see, in my scenario, if I use delta streamer:
 * Delta streamer can help ingested data from kafka
 * It have a self managed checkpoint.
 * I can  set the compaction job weight

and if I use spark data source writer:
 * I have more control of my code, I can have my own kafka ingestion 
implementation
 * I will save money :) (still costs lower if delta streamer have a throttle 
control for now)

 

so if currently if I can't make two request matrix on the same level, I'd use 
data source writer. 
 Any more suggestions on choosing delta streamer?


was (Author: xingxpan):
[~vinoth]
I thought it's because of that delta streamer too aggressive too, so I add the 
throttle param to control this, and it helped.
now the issue is:
I do significantly reduce requests count by set delta streamer *throttle to 5 
seconds*.
 * when there is *nothing* coming from kafka, request matrix looks acceptable 
both in data source writer way and delta streamer way.
 * But if I have kafka input streaming like 10 records per second, I found that 
even if I set the 5 seconds throttle, writing hudi with delta stream will cause 
10 times more request than do it in data source writer way.
so I would choose data source writer anyway since in this way I always save 
requests count.

so now I am wondering what are the pro and cons for me to choose between spark 
datasource and delta streamer.
as far as I can see, in my scenario, if I use delta streamer:
 * Delta streamer can help ingested data from kafka
 * It have a self managed checkpoint.
 * I can  set the compaction job weight

and if I use spark data source writer:
 * I have more control of my code, I can have my own kafka ingestion 
implementation
 * I will save money :) (still costs lower if delta streamer have a throttle 
control for now)

 

so if currently if I can't make two request matrix on the same level, I'd use 
data source writer. 
Any more suggestions on choosing delta streamer?

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Balaji Varadarajan
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-27 Thread Xing Pan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939797#comment-16939797
 ] 

Xing Pan commented on HUDI-269:
---

[~vinoth]
I thought it's because of that delta streamer too aggressive too, so I add the 
throttle param to control this, and it helped.
now the issue is:
I do significantly reduce requests count by set delta streamer *throttle to 5 
seconds*.
 * when there is *nothing* coming from kafka, request matrix looks acceptable 
both in data source writer way and delta streamer way.
 * But if I have kafka input streaming like 10 records per second, I found that 
even if I set the 5 seconds throttle, writing hudi with delta stream will cause 
10 times more request than do it in data source writer way.
so I would choose data source writer anyway since in this way I always save 
requests count.

so now I am wondering what are the pro and cons for me to choose between spark 
datasource and delta streamer.
as far as I can see, in my scenario, if I use delta streamer:
 * Delta streamer can help ingested data from kafka
 * It have a self managed checkpoint.
 * I can  set the compaction job weight

and if I use spark data source writer:
 * I have more control of my code, I can have my own kafka ingestion 
implementation
 * I will save money :) (still costs lower if delta streamer have a throttle 
control for now)

 

so if currently if I can't make two request matrix on the same level, I'd use 
data source writer. 
Any more suggestions on choosing delta streamer?

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Balaji Varadarajan
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-278) Translate Administering page

2019-09-27 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-278.
--
Resolution: Fixed

Fixed via asf-site: 7ff0fe2a0754771fdf27d8280212a452f4e9a269

> Translate Administering page
> 
>
> Key: HUDI-278
> URL: https://issues.apache.org/jira/browse/HUDI-278
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: docs-chinese
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> he online HTML web page: [http://hudi.apache.org/admin_guide.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC 
incremental load to HUDI with DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/917#discussion_r329190769
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java
 ##
 @@ -0,0 +1,233 @@
+package org.apache.hudi.utilities.sources;
+
+import java.util.Arrays;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class JDBCSource extends RowSource {
+
+  private static Logger LOG = LoggerFactory.getLogger(JDBCSource.class);
+
+  private final String ppdQuery = "(select * from %s where %s >= \" %s \") 
rdbms_table";
+
+
+  public JDBCSource(TypedProperties props, JavaSparkContext sparkContext, 
SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  private static DataFrameReader validatePropsAndGetDataFrameReader(final 
SparkSession session,
+  final TypedProperties properties)
+  throws HoodieException {
+FSDataInputStream passwordFileStream = null;
+try {
+  DataFrameReader dataFrameReader = session.read().format("jdbc");
+  dataFrameReader = dataFrameReader.option(Config.URL_PROP, 
properties.getString(Config.URL));
+  dataFrameReader = dataFrameReader.option(Config.USER_PROP, 
properties.getString(Config.USER));
+  dataFrameReader = dataFrameReader.option(Config.DRIVER_PROP, 
properties.getString(Config.DRIVER_CLASS));
+  dataFrameReader = dataFrameReader
+  .option(Config.RDBMS_TABLE_PROP, 
properties.getString(Config.RDBMS_TABLE_NAME));
+
+  if (!properties.containsKey(Config.PASSWORD)) {
 
 Review comment:
   2c. could be lot simpler, if we just checked one by one and then at the end 
, threw the error based on whether password was obtainable
   
   ```
   String password = null;
   if (properties.containsKey(Config.PASSWORD_FILE)) {
  password = //set if you can read it from file
   }
   if (properties.getString(Config.PASSWORD)) {
 password = //set value. 
   }
   if (password == null) {
 // throw the error 
   }
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC 
incremental load to HUDI with DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/917#discussion_r329192914
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java
 ##
 @@ -0,0 +1,233 @@
+package org.apache.hudi.utilities.sources;
+
+import java.util.Arrays;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class JDBCSource extends RowSource {
+
+  private static Logger LOG = LoggerFactory.getLogger(JDBCSource.class);
+
+  private final String ppdQuery = "(select * from %s where %s >= \" %s \") 
rdbms_table";
+
+
+  public JDBCSource(TypedProperties props, JavaSparkContext sparkContext, 
SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  private static DataFrameReader validatePropsAndGetDataFrameReader(final 
SparkSession session,
+  final TypedProperties properties)
+  throws HoodieException {
+FSDataInputStream passwordFileStream = null;
+try {
+  DataFrameReader dataFrameReader = session.read().format("jdbc");
+  dataFrameReader = dataFrameReader.option(Config.URL_PROP, 
properties.getString(Config.URL));
+  dataFrameReader = dataFrameReader.option(Config.USER_PROP, 
properties.getString(Config.USER));
+  dataFrameReader = dataFrameReader.option(Config.DRIVER_PROP, 
properties.getString(Config.DRIVER_CLASS));
+  dataFrameReader = dataFrameReader
+  .option(Config.RDBMS_TABLE_PROP, 
properties.getString(Config.RDBMS_TABLE_NAME));
+
+  if (!properties.containsKey(Config.PASSWORD)) {
+if (properties.containsKey(Config.PASSWORD_FILE)) {
+  if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD_FILE))) {
+LOG.info("Reading password for password file {}", 
properties.getString(Config.PASSWORD_FILE));
+FileSystem fileSystem = FileSystem.get(new Configuration());
+passwordFileStream = fileSystem.open(new 
Path(properties.getString(Config.PASSWORD_FILE)));
+byte[] bytes = new byte[passwordFileStream.available()];
+passwordFileStream.read(bytes);
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, new 
String(bytes));
+  } else {
+throw new IllegalArgumentException(
+String.format("%s property cannot be null or empty", 
Config.PASSWORD_FILE));
+  }
+} else {
+  throw new IllegalArgumentException(String.format("JDBCSource needs 
either a %s or %s to connect to RDBMS "
+  + "datasource", Config.PASSWORD_FILE, Config.PASSWORD));
+}
+  } else if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD))) {
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, 
properties.getString(Config.PASSWORD));
+  } else {
+throw new IllegalArgumentException(String.format("%s cannot be null or 
empty. ", Config.PASSWORD));
+  }
+  if (properties.containsKey(Config.EXTRA_OPTIONS)) {
+if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.EXTRA_OPTIONS))) {
+  LOG.info("Setting {}", Config.EXTRA_OPTIONS);
+  String[] options = 
properties.getString(Config.EXTRA_OPTIONS).split(",");
+  for (String option : options) {
+if (!StringUtils.isNullOrEmpty(option)) {
+  String[] kv = option.split("=");
+  dataFrameReader = dataFrameReader.option(kv[0], kv[1]);
+  LOG.info("{} = {} has been set for JDBC pull ", kv[0], kv[1]);
+}
+  }
+}
+  }
+  if (properties.getBoolean(Config.IS_INCREMENTAL)) {
+DataSourceUtils.checkRequiredProperties(properties, 
Arrays.asList(Config.INCREMENTAL_COLUMN));
+  }
+  return dataFrameReader;
+} catch (Exception e) {
+  throw new HoodieException(e);
+} finally {
+  

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC 
incremental load to HUDI with DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/917#discussion_r329192724
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java
 ##
 @@ -0,0 +1,233 @@
+package org.apache.hudi.utilities.sources;
+
+import java.util.Arrays;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class JDBCSource extends RowSource {
+
+  private static Logger LOG = LoggerFactory.getLogger(JDBCSource.class);
+
+  private final String ppdQuery = "(select * from %s where %s >= \" %s \") 
rdbms_table";
+
+
+  public JDBCSource(TypedProperties props, JavaSparkContext sparkContext, 
SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  private static DataFrameReader validatePropsAndGetDataFrameReader(final 
SparkSession session,
+  final TypedProperties properties)
+  throws HoodieException {
+FSDataInputStream passwordFileStream = null;
+try {
+  DataFrameReader dataFrameReader = session.read().format("jdbc");
+  dataFrameReader = dataFrameReader.option(Config.URL_PROP, 
properties.getString(Config.URL));
+  dataFrameReader = dataFrameReader.option(Config.USER_PROP, 
properties.getString(Config.USER));
+  dataFrameReader = dataFrameReader.option(Config.DRIVER_PROP, 
properties.getString(Config.DRIVER_CLASS));
+  dataFrameReader = dataFrameReader
+  .option(Config.RDBMS_TABLE_PROP, 
properties.getString(Config.RDBMS_TABLE_NAME));
+
+  if (!properties.containsKey(Config.PASSWORD)) {
+if (properties.containsKey(Config.PASSWORD_FILE)) {
+  if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD_FILE))) {
+LOG.info("Reading password for password file {}", 
properties.getString(Config.PASSWORD_FILE));
+FileSystem fileSystem = FileSystem.get(new Configuration());
+passwordFileStream = fileSystem.open(new 
Path(properties.getString(Config.PASSWORD_FILE)));
+byte[] bytes = new byte[passwordFileStream.available()];
+passwordFileStream.read(bytes);
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, new 
String(bytes));
+  } else {
+throw new IllegalArgumentException(
+String.format("%s property cannot be null or empty", 
Config.PASSWORD_FILE));
+  }
+} else {
+  throw new IllegalArgumentException(String.format("JDBCSource needs 
either a %s or %s to connect to RDBMS "
+  + "datasource", Config.PASSWORD_FILE, Config.PASSWORD));
+}
+  } else if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD))) {
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, 
properties.getString(Config.PASSWORD));
+  } else {
+throw new IllegalArgumentException(String.format("%s cannot be null or 
empty. ", Config.PASSWORD));
+  }
+  if (properties.containsKey(Config.EXTRA_OPTIONS)) {
+if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.EXTRA_OPTIONS))) {
+  LOG.info("Setting {}", Config.EXTRA_OPTIONS);
+  String[] options = 
properties.getString(Config.EXTRA_OPTIONS).split(",");
+  for (String option : options) {
+if (!StringUtils.isNullOrEmpty(option)) {
+  String[] kv = option.split("=");
+  dataFrameReader = dataFrameReader.option(kv[0], kv[1]);
+  LOG.info("{} = {} has been set for JDBC pull ", kv[0], kv[1]);
+}
+  }
+}
+  }
+  if (properties.getBoolean(Config.IS_INCREMENTAL)) {
+DataSourceUtils.checkRequiredProperties(properties, 
Arrays.asList(Config.INCREMENTAL_COLUMN));
+  }
+  return dataFrameReader;
+} catch (Exception e) {
+  throw new HoodieException(e);
+} finally {
+  

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC 
incremental load to HUDI with DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/917#discussion_r329194622
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java
 ##
 @@ -0,0 +1,233 @@
+package org.apache.hudi.utilities.sources;
+
+import java.util.Arrays;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class JDBCSource extends RowSource {
+
+  private static Logger LOG = LoggerFactory.getLogger(JDBCSource.class);
+
+  private final String ppdQuery = "(select * from %s where %s >= \" %s \") 
rdbms_table";
+
+
+  public JDBCSource(TypedProperties props, JavaSparkContext sparkContext, 
SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  private static DataFrameReader validatePropsAndGetDataFrameReader(final 
SparkSession session,
+  final TypedProperties properties)
+  throws HoodieException {
+FSDataInputStream passwordFileStream = null;
+try {
+  DataFrameReader dataFrameReader = session.read().format("jdbc");
+  dataFrameReader = dataFrameReader.option(Config.URL_PROP, 
properties.getString(Config.URL));
+  dataFrameReader = dataFrameReader.option(Config.USER_PROP, 
properties.getString(Config.USER));
+  dataFrameReader = dataFrameReader.option(Config.DRIVER_PROP, 
properties.getString(Config.DRIVER_CLASS));
+  dataFrameReader = dataFrameReader
+  .option(Config.RDBMS_TABLE_PROP, 
properties.getString(Config.RDBMS_TABLE_NAME));
+
+  if (!properties.containsKey(Config.PASSWORD)) {
+if (properties.containsKey(Config.PASSWORD_FILE)) {
+  if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD_FILE))) {
+LOG.info("Reading password for password file {}", 
properties.getString(Config.PASSWORD_FILE));
+FileSystem fileSystem = FileSystem.get(new Configuration());
+passwordFileStream = fileSystem.open(new 
Path(properties.getString(Config.PASSWORD_FILE)));
+byte[] bytes = new byte[passwordFileStream.available()];
+passwordFileStream.read(bytes);
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, new 
String(bytes));
+  } else {
+throw new IllegalArgumentException(
+String.format("%s property cannot be null or empty", 
Config.PASSWORD_FILE));
+  }
+} else {
+  throw new IllegalArgumentException(String.format("JDBCSource needs 
either a %s or %s to connect to RDBMS "
+  + "datasource", Config.PASSWORD_FILE, Config.PASSWORD));
+}
+  } else if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD))) {
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, 
properties.getString(Config.PASSWORD));
+  } else {
+throw new IllegalArgumentException(String.format("%s cannot be null or 
empty. ", Config.PASSWORD));
+  }
+  if (properties.containsKey(Config.EXTRA_OPTIONS)) {
+if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.EXTRA_OPTIONS))) {
+  LOG.info("Setting {}", Config.EXTRA_OPTIONS);
+  String[] options = 
properties.getString(Config.EXTRA_OPTIONS).split(",");
+  for (String option : options) {
+if (!StringUtils.isNullOrEmpty(option)) {
+  String[] kv = option.split("=");
+  dataFrameReader = dataFrameReader.option(kv[0], kv[1]);
+  LOG.info("{} = {} has been set for JDBC pull ", kv[0], kv[1]);
+}
+  }
+}
+  }
+  if (properties.getBoolean(Config.IS_INCREMENTAL)) {
+DataSourceUtils.checkRequiredProperties(properties, 
Arrays.asList(Config.INCREMENTAL_COLUMN));
+  }
+  return dataFrameReader;
+} catch (Exception e) {
+  throw new HoodieException(e);
+} finally {
+  

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC 
incremental load to HUDI with DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/917#discussion_r329118884
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java
 ##
 @@ -0,0 +1,233 @@
+package org.apache.hudi.utilities.sources;
+
+import java.util.Arrays;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class JDBCSource extends RowSource {
+
+  private static Logger LOG = LoggerFactory.getLogger(JDBCSource.class);
+
+  private final String ppdQuery = "(select * from %s where %s >= \" %s \") 
rdbms_table";
 
 Review comment:
   nit: static? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC 
incremental load to HUDI with DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/917#discussion_r329191253
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java
 ##
 @@ -0,0 +1,233 @@
+package org.apache.hudi.utilities.sources;
+
+import java.util.Arrays;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class JDBCSource extends RowSource {
+
+  private static Logger LOG = LoggerFactory.getLogger(JDBCSource.class);
+
+  private final String ppdQuery = "(select * from %s where %s >= \" %s \") 
rdbms_table";
+
+
+  public JDBCSource(TypedProperties props, JavaSparkContext sparkContext, 
SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  private static DataFrameReader validatePropsAndGetDataFrameReader(final 
SparkSession session,
+  final TypedProperties properties)
+  throws HoodieException {
+FSDataInputStream passwordFileStream = null;
+try {
+  DataFrameReader dataFrameReader = session.read().format("jdbc");
+  dataFrameReader = dataFrameReader.option(Config.URL_PROP, 
properties.getString(Config.URL));
+  dataFrameReader = dataFrameReader.option(Config.USER_PROP, 
properties.getString(Config.USER));
+  dataFrameReader = dataFrameReader.option(Config.DRIVER_PROP, 
properties.getString(Config.DRIVER_CLASS));
+  dataFrameReader = dataFrameReader
+  .option(Config.RDBMS_TABLE_PROP, 
properties.getString(Config.RDBMS_TABLE_NAME));
+
+  if (!properties.containsKey(Config.PASSWORD)) {
+if (properties.containsKey(Config.PASSWORD_FILE)) {
+  if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD_FILE))) {
+LOG.info("Reading password for password file {}", 
properties.getString(Config.PASSWORD_FILE));
+FileSystem fileSystem = FileSystem.get(new Configuration());
+passwordFileStream = fileSystem.open(new 
Path(properties.getString(Config.PASSWORD_FILE)));
+byte[] bytes = new byte[passwordFileStream.available()];
+passwordFileStream.read(bytes);
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, new 
String(bytes));
+  } else {
+throw new IllegalArgumentException(
+String.format("%s property cannot be null or empty", 
Config.PASSWORD_FILE));
+  }
+} else {
+  throw new IllegalArgumentException(String.format("JDBCSource needs 
either a %s or %s to connect to RDBMS "
+  + "datasource", Config.PASSWORD_FILE, Config.PASSWORD));
+}
+  } else if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD))) {
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, 
properties.getString(Config.PASSWORD));
+  } else {
+throw new IllegalArgumentException(String.format("%s cannot be null or 
empty. ", Config.PASSWORD));
+  }
+  if (properties.containsKey(Config.EXTRA_OPTIONS)) {
 
 Review comment:
   instead of splitting/ecnoding using `,` like this.. can you just take all 
properites that begin with the prefix. `hoodie.datasource.jdbc`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC 
incremental load to HUDI with DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/917#discussion_r329193472
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java
 ##
 @@ -0,0 +1,233 @@
+package org.apache.hudi.utilities.sources;
+
+import java.util.Arrays;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class JDBCSource extends RowSource {
+
+  private static Logger LOG = LoggerFactory.getLogger(JDBCSource.class);
+
+  private final String ppdQuery = "(select * from %s where %s >= \" %s \") 
rdbms_table";
+
+
+  public JDBCSource(TypedProperties props, JavaSparkContext sparkContext, 
SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  private static DataFrameReader validatePropsAndGetDataFrameReader(final 
SparkSession session,
+  final TypedProperties properties)
+  throws HoodieException {
+FSDataInputStream passwordFileStream = null;
+try {
+  DataFrameReader dataFrameReader = session.read().format("jdbc");
+  dataFrameReader = dataFrameReader.option(Config.URL_PROP, 
properties.getString(Config.URL));
+  dataFrameReader = dataFrameReader.option(Config.USER_PROP, 
properties.getString(Config.USER));
+  dataFrameReader = dataFrameReader.option(Config.DRIVER_PROP, 
properties.getString(Config.DRIVER_CLASS));
+  dataFrameReader = dataFrameReader
+  .option(Config.RDBMS_TABLE_PROP, 
properties.getString(Config.RDBMS_TABLE_NAME));
+
+  if (!properties.containsKey(Config.PASSWORD)) {
+if (properties.containsKey(Config.PASSWORD_FILE)) {
+  if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD_FILE))) {
+LOG.info("Reading password for password file {}", 
properties.getString(Config.PASSWORD_FILE));
+FileSystem fileSystem = FileSystem.get(new Configuration());
+passwordFileStream = fileSystem.open(new 
Path(properties.getString(Config.PASSWORD_FILE)));
+byte[] bytes = new byte[passwordFileStream.available()];
+passwordFileStream.read(bytes);
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, new 
String(bytes));
+  } else {
+throw new IllegalArgumentException(
+String.format("%s property cannot be null or empty", 
Config.PASSWORD_FILE));
+  }
+} else {
+  throw new IllegalArgumentException(String.format("JDBCSource needs 
either a %s or %s to connect to RDBMS "
+  + "datasource", Config.PASSWORD_FILE, Config.PASSWORD));
+}
+  } else if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD))) {
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, 
properties.getString(Config.PASSWORD));
+  } else {
+throw new IllegalArgumentException(String.format("%s cannot be null or 
empty. ", Config.PASSWORD));
+  }
+  if (properties.containsKey(Config.EXTRA_OPTIONS)) {
+if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.EXTRA_OPTIONS))) {
+  LOG.info("Setting {}", Config.EXTRA_OPTIONS);
+  String[] options = 
properties.getString(Config.EXTRA_OPTIONS).split(",");
+  for (String option : options) {
+if (!StringUtils.isNullOrEmpty(option)) {
+  String[] kv = option.split("=");
+  dataFrameReader = dataFrameReader.option(kv[0], kv[1]);
+  LOG.info("{} = {} has been set for JDBC pull ", kv[0], kv[1]);
+}
+  }
+}
+  }
+  if (properties.getBoolean(Config.IS_INCREMENTAL)) {
+DataSourceUtils.checkRequiredProperties(properties, 
Arrays.asList(Config.INCREMENTAL_COLUMN));
+  }
+  return dataFrameReader;
+} catch (Exception e) {
+  throw new HoodieException(e);
+} finally {
+  

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #917: [HUDI-251] JDBC 
incremental load to HUDI with DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/917#discussion_r329195751
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java
 ##
 @@ -0,0 +1,233 @@
+package org.apache.hudi.utilities.sources;
+
+import java.util.Arrays;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class JDBCSource extends RowSource {
+
+  private static Logger LOG = LoggerFactory.getLogger(JDBCSource.class);
+
+  private final String ppdQuery = "(select * from %s where %s >= \" %s \") 
rdbms_table";
+
+
+  public JDBCSource(TypedProperties props, JavaSparkContext sparkContext, 
SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  private static DataFrameReader validatePropsAndGetDataFrameReader(final 
SparkSession session,
+  final TypedProperties properties)
+  throws HoodieException {
+FSDataInputStream passwordFileStream = null;
+try {
+  DataFrameReader dataFrameReader = session.read().format("jdbc");
+  dataFrameReader = dataFrameReader.option(Config.URL_PROP, 
properties.getString(Config.URL));
+  dataFrameReader = dataFrameReader.option(Config.USER_PROP, 
properties.getString(Config.USER));
+  dataFrameReader = dataFrameReader.option(Config.DRIVER_PROP, 
properties.getString(Config.DRIVER_CLASS));
+  dataFrameReader = dataFrameReader
+  .option(Config.RDBMS_TABLE_PROP, 
properties.getString(Config.RDBMS_TABLE_NAME));
+
+  if (!properties.containsKey(Config.PASSWORD)) {
+if (properties.containsKey(Config.PASSWORD_FILE)) {
+  if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD_FILE))) {
+LOG.info("Reading password for password file {}", 
properties.getString(Config.PASSWORD_FILE));
+FileSystem fileSystem = FileSystem.get(new Configuration());
+passwordFileStream = fileSystem.open(new 
Path(properties.getString(Config.PASSWORD_FILE)));
+byte[] bytes = new byte[passwordFileStream.available()];
+passwordFileStream.read(bytes);
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, new 
String(bytes));
+  } else {
+throw new IllegalArgumentException(
+String.format("%s property cannot be null or empty", 
Config.PASSWORD_FILE));
+  }
+} else {
+  throw new IllegalArgumentException(String.format("JDBCSource needs 
either a %s or %s to connect to RDBMS "
+  + "datasource", Config.PASSWORD_FILE, Config.PASSWORD));
+}
+  } else if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.PASSWORD))) {
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, 
properties.getString(Config.PASSWORD));
+  } else {
+throw new IllegalArgumentException(String.format("%s cannot be null or 
empty. ", Config.PASSWORD));
+  }
+  if (properties.containsKey(Config.EXTRA_OPTIONS)) {
+if 
(!StringUtils.isNullOrEmpty(properties.getString(Config.EXTRA_OPTIONS))) {
+  LOG.info("Setting {}", Config.EXTRA_OPTIONS);
+  String[] options = 
properties.getString(Config.EXTRA_OPTIONS).split(",");
+  for (String option : options) {
+if (!StringUtils.isNullOrEmpty(option)) {
+  String[] kv = option.split("=");
+  dataFrameReader = dataFrameReader.option(kv[0], kv[1]);
+  LOG.info("{} = {} has been set for JDBC pull ", kv[0], kv[1]);
+}
+  }
+}
+  }
+  if (properties.getBoolean(Config.IS_INCREMENTAL)) {
+DataSourceUtils.checkRequiredProperties(properties, 
Arrays.asList(Config.INCREMENTAL_COLUMN));
+  }
+  return dataFrameReader;
+} catch (Exception e) {
+  throw new HoodieException(e);
+} finally {
+  

[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-27 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939647#comment-16939647
 ] 

Vinoth Chandar commented on HUDI-269:
-

When you are writing your own spark app, specifying hoodie options using --conf 
may not be taking effect? the ones below? 

{code}
--conf hoodie.embed.timeline.server=true \
--conf hoodie.filesystem.view.incr.timeline.sync.enable=true \
{code} 

I am wondering if there requests are just due to delta streamer being way too 
aggressively listing the partition over and over. (unlike the spark streaming 
code, which has a sane 5s processing window? Was deltastreamer generating many 
commits/second without teh throttle? 



> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Balaji Varadarajan
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-27 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939642#comment-16939642
 ] 

Vinoth Chandar commented on HUDI-269:
-

[~uditme] as fyi, in case you can quickly spot anything.. :) 

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Balaji Varadarajan
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-285) Implement HoodieStorageWriter based on the metadata

2019-09-27 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939638#comment-16939638
 ] 

Vinoth Chandar commented on HUDI-285:
-

In general, in support of actually instantiating the writer based on the actual 
file type.. there is a HoodieFileFormat class already..  may be we can just 
reuse that? 

> Implement HoodieStorageWriter based on the metadata
> ---
>
> Key: HUDI-285
> URL: https://issues.apache.org/jira/browse/HUDI-285
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Write Client
>Reporter: leesf
>Assignee: leesf
>Priority: Major
> Fix For: 0.5.1
>
>
> Currently the _getStorageWriter_ method in HoodieStorageWriterFactory to get 
> HoodieStorageWriter is hard code to HoodieParquetWriter since currently only 
> parquet is supported for HoodieStorageWriter. However, it is better to 
> implement HoodieStorageWriter based on the metadata for extension. And if 
> _StorageWriterType_ is emtpy in metadata, the default HoodieParquetWriter is 
> returned to not affect the current logic.
> cc [~vinoth] [~vbalaji]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar merged pull request #926: [HUDI-278] Translate Administering page

2019-09-27 Thread GitBox
vinothchandar merged pull request #926: [HUDI-278] Translate Administering page
URL: https://github.com/apache/incubator-hudi/pull/926
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch asf-site updated: [HUDI-278] Translate Administering page (#926)

2019-09-27 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 7ff0fe2  [HUDI-278] Translate Administering page (#926)
7ff0fe2 is described below

commit 7ff0fe2a0754771fdf27d8280212a452f4e9a269
Author: leesf <490081...@qq.com>
AuthorDate: Sat Sep 28 01:48:35 2019 +0800

[HUDI-278] Translate Administering page (#926)

* [HUDI-278] Translate Administering page

* [hotfix] address comments
---
 docs/admin_guide.cn.md | 177 +++--
 1 file changed, 83 insertions(+), 94 deletions(-)

diff --git a/docs/admin_guide.cn.md b/docs/admin_guide.cn.md
index 96ff639..2ba04a5 100644
--- a/docs/admin_guide.cn.md
+++ b/docs/admin_guide.cn.md
@@ -4,24 +4,24 @@ keywords: hudi, administration, operation, devops
 sidebar: mydoc_sidebar
 permalink: admin_guide.html
 toc: false
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+summary: 本节概述了可用于操作Hudi数据集生态系统的工具
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+管理员/运维人员可以通过以下方式了解Hudi数据集/管道
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [通过Admin CLI进行管理](#admin-cli)
+ - [Graphite指标](#metrics)
+ - [Hudi应用程序的Spark UI](#spark-ui)
 
-This section provides a glimpse into each of these, with some general guidance 
on [troubleshooting](#troubleshooting)
+本节简要介绍了每一种方法,并提供了有关[故障排除](#troubleshooting)的一些常规指南
 
 ## Admin CLI {#admin-cli}
 
-Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`.
-A hudi dataset resides on DFS, in a location referred to as the **basePath** 
and we would need this location in order to connect to a Hudi dataset.
-Hudi library effectively manages this dataset internally, using .hoodie 
subfolder to track all metadata
+一旦构建了hudi,就可以通过`cd hudi-cli && ./hudi-cli.sh`启动shell。
+一个hudi数据集位于DFS上的**basePath**位置,我们需要该位置才能连接到Hudi数据集。
+Hudi库使用.hoodie子文件夹跟踪所有元数据,从而有效地在内部管理该数据集。
 
-To initialize a hudi table, use the following command.
+初始化hudi表,可使用如下命令。
 
 ```
 18/09/06 15:56:52 INFO annotation.AutowiredAnnotationBeanPostProcessor: 
JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
@@ -58,7 +58,7 @@ hoodie:hoodie_table_1->desc
 | hoodie.archivelog.folder|  |
 ```
 
-Following is a sample command to connect to a Hudi dataset contains uber trips.
+以下是连接到包含uber trips的Hudi数据集的示例命令。
 
 ```
 hoodie:trips->connect --path /app/uber/trips
@@ -70,8 +70,7 @@ Metadata for table trips loaded
 hoodie:trips->
 ```
 
-Once connected to the dataset, a lot of other commands become available. The 
shell has contextual autocomplete help (press TAB) and below is a list of all 
commands, few of which are reviewed in this section
-are reviewed
+连接到数据集后,便可使用许多其他命令。该shell程序具有上下文自动完成帮助(按TAB键),下面是所有命令的列表,本节中对其中的一些命令进行了详细示例。
 
 
 ```
@@ -107,12 +106,12 @@ hoodie:trips->
 ```
 
 
- Inspecting Commits
+ 检查提交
 
-The task of upserting or inserting a batch of incoming records is known as a 
**commit** in Hudi. A commit provides basic atomicity guarantees such that only 
commited data is available for querying.
-Each commit has a monotonically increasing string/number called the **commit 
number**. Typically, this is the time at which we started the commit.
+在Hudi中,更新或插入一批记录的任务被称为**提交**。提交可提供基本的原子性保证,即只有提交的数据可用于查询。
+每个提交都有一个单调递增的字符串/数字,称为**提交编号**。通常,这是我们开始提交的时间。
 
-To view some basic information about the last 10 commits,
+查看有关最近10次提交的一些基本信息,
 
 
 ```
@@ -126,8 +125,7 @@ hoodie:trips->commits show --sortBy "Total Bytes Written" 
--desc true --limit 10
 hoodie:trips->
 ```
 
-At the start of each write, Hudi also writes a .inflight commit to the .hoodie 
folder. You can use the timestamp there to estimate how long the commit has 
been inflight
-
+在每次写入开始时,Hudi还将.inflight提交写入.hoodie文件夹。您可以使用那里的时间戳来估计正在进行的提交已经花费的时间
 
 ```
 $ hdfs dfs -ls /app/uber/trips/.hoodie/*.inflight
@@ -135,9 +133,9 @@ $ hdfs dfs -ls /app/uber/trips/.hoodie/*.inflight
 ```
 
 
- Drilling Down to a specific Commit
+ 深入到特定的提交
 
-To understand how the writes spread across specific partiions,
+了解写入如何分散到特定分区,
 
 
 ```
@@ -149,8 +147,7 @@ hoodie:trips->commit showpartitions --commit 20161005165855 
--sortBy "Total Byte
  
 ```
 
-If you need file level granularity , we can do the following
-
+如果您需要文件级粒度,我们可以执行以下操作
 
 ```
 hoodie:trips->commit showfiles --commit 20161005165855 --sortBy "Partition 
Path"
@@ -162,10 +159,9 @@ hoodie:trips->commit showfiles --commit 20161005165855 
--sortBy "Partition Path"
 ```
 
 
- FileSystem View
+ 文件系统视图
 
-Hudi views each partition as a collection of file-groups with each file-group 
containing a list of file-slices in commit
-order (See 

[GitHub] [incubator-hudi] vinothchandar commented on issue #923: HUDI-247 Unify the initialization of HoodieTableMetaClient in test for hoodie-client module

2019-09-27 Thread GitBox
vinothchandar commented on issue #923: HUDI-247 Unify the initialization of 
HoodieTableMetaClient in test for hoodie-client module
URL: https://github.com/apache/incubator-hudi/pull/923#issuecomment-535981015
 
 
   >IMHO, we can create a new object. However, I still think define a static 
method like reloadMetaClient to create a new object is better than just 
creating a new object directly in test class or other context.
   I like this idea. Would we pass in the old metaclient and get a refreshed 
one? More like a clone/copy method.. `newClient  = 
HoodieTableMetaClient.reload(oldClient)` . 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #923: HUDI-247 Unify the initialization of HoodieTableMetaClient in test for hoodie-client module

2019-09-27 Thread GitBox
vinothchandar edited a comment on issue #923: HUDI-247 Unify the initialization 
of HoodieTableMetaClient in test for hoodie-client module
URL: https://github.com/apache/incubator-hudi/pull/923#issuecomment-535981015
 
 
   >IMHO, we can create a new object. However, I still think define a static 
method like reloadMetaClient to create a new object is better than just 
creating a new object directly in test class or other context.
   
   I like this idea. Would we pass in the old metaclient and get a refreshed 
one? More like a clone/copy method.. `newClient  = 
HoodieTableMetaClient.reload(oldClient)` . 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #915: [HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #915: [HUDI-268] Shade and 
relocate Avro dependency in hadoop-mr-bundle
URL: https://github.com/apache/incubator-hudi/pull/915#discussion_r329118485
 
 

 ##
 File path: packaging/hudi-hadoop-mr-bundle/pom.xml
 ##
 @@ -127,5 +133,20 @@
   parquet-avro
   compile
 
+
+
+  org.apache.avro
+  avro
+  ${avro.scope}
+
   
+
+  
+
+  shade-avro
 
 Review comment:
   same here.. have the bundle name in the profile id? and probably move this 
to root poom as may be `aws-emr-profile` , that way you can control 
other overrides as well 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #915: [HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #915: [HUDI-268] Shade and 
relocate Avro dependency in hadoop-mr-bundle
URL: https://github.com/apache/incubator-hudi/pull/915#discussion_r329117901
 
 

 ##
 File path: packaging/hudi-hadoop-mr-bundle/pom.xml
 ##
 @@ -30,6 +30,7 @@
 true
 ${project.basedir}/src/main/resources/META-INF
 HUDI_NOTICE.txt
+provided
 
 Review comment:
   nit: rename to `mr.bundle.avro.scope` ? in case, we decide this strategy for 
other bundles too? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #923: HUDI-247 Unify the initialization of HoodieTableMetaClient in test for hoodie-client module

2019-09-27 Thread GitBox
yanghua commented on issue #923: HUDI-247 Unify the initialization of 
HoodieTableMetaClient in test for hoodie-client module
URL: https://github.com/apache/incubator-hudi/pull/923#issuecomment-535963964
 
 
   If we do not need to release File System object. The value of the method of 
reloading metadata client is reduced.
   IMHO, we can create a new object. However, I still think define a static 
method like `reloadMetaClient ` to create a new object is better than just 
creating a new object directly in test class or other context. Because this 
code (`new HoodieTableMetaClient(...)`) can not describe the **intention** and 
**necessity**. But I respect your decision. Closing this PR now.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua closed pull request #923: HUDI-247 Unify the initialization of HoodieTableMetaClient in test for hoodie-client module

2019-09-27 Thread GitBox
yanghua closed pull request #923: HUDI-247 Unify the initialization of 
HoodieTableMetaClient in test for hoodie-client module
URL: https://github.com/apache/incubator-hudi/pull/923
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #923: HUDI-247 Unify the initialization of HoodieTableMetaClient in test for hoodie-client module

2019-09-27 Thread GitBox
yanghua commented on a change in pull request #923: HUDI-247 Unify the 
initialization of HoodieTableMetaClient in test for hoodie-client module
URL: https://github.com/apache/incubator-hudi/pull/923#discussion_r329094509
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -477,4 +478,43 @@ public void setActiveTimeline(HoodieActiveTimeline 
activeTimeline) {
   public void setTableConfig(HoodieTableConfig tableConfig) {
 this.tableConfig = tableConfig;
   }
+
+  //--
+  //Resource initialization and cleanup
+  //--
+
+  private void init(Configuration conf, String basePath,
+  boolean loadActiveTimelineOnLoad, ConsistencyGuardConfig 
consistencyGuardConfig) {
+log.info("Loading HoodieTableMetaClient from " + basePath);
+this.basePath = basePath;
+this.consistencyGuardConfig = consistencyGuardConfig;
+this.hadoopConf = new SerializableConfiguration(conf);
+Path basePathDir = new Path(this.basePath);
+this.metaPath = new Path(basePath, METAFOLDER_NAME).toString();
+Path metaPathDir = new Path(this.metaPath);
+this.fs = getFs();
+DatasetNotFoundException.checkValidDataset(fs, basePathDir, metaPathDir);
+this.tableConfig = new HoodieTableConfig(fs, metaPath);
+this.tableType = tableConfig.getTableType();
+log.info("Finished Loading Table of type " + tableType + " from " + 
basePath);
+if (loadActiveTimelineOnLoad) {
+  log.info("Loading Active commit timeline for " + basePath);
+  getActiveTimeline();
+}
+  }
+
+  private void cleanup() throws IOException {
+this.basePath = null;
+this.consistencyGuardConfig = null;
+this.hadoopConf = null;
+this.metaPath = null;
+if (this.fs != null) {
 
 Review comment:
   I just see the source code of Hadoop `FileSystem`. You are right. 
`FileSystem.get` will get an instance from cache, `FileSystem.newInstance` can 
returns a uniquie File System object. Sorry, I did not pay attention to it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-27 Thread GitBox
yanghua commented on a change in pull request #928: [HUDI-265] Failed to delete 
tmp dirs created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928#discussion_r329088776
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/table/log/HoodieLogFormatTest.java
 ##
 @@ -78,6 +80,8 @@
 @RunWith(Parameterized.class)
 public class HoodieLogFormatTest {
 
+  @Rule
+  public TemporaryFolder folder = new TemporaryFolder();
 
 Review comment:
   +1 for the solution about unifying resource management.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #928: [HUDI-265] Failed to 
delete tmp dirs created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928#discussion_r329069160
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
 ##
 @@ -968,6 +968,7 @@ public void testInsertsGeneratedIntoLogFilesRollback() 
throws Exception {
   Thread.sleep(1000);
   // Rollback again to pretend the first rollback failed partially. This 
should not error our
   writeClient.rollback(newCommitTime);
+  folder.delete();
 
 Review comment:
   does nt this happen already at the HoodieClientTestHarness level? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #928: [HUDI-265] Failed to 
delete tmp dirs created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928#discussion_r329071805
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/table/log/HoodieLogFormatTest.java
 ##
 @@ -78,6 +80,8 @@
 @RunWith(Parameterized.class)
 public class HoodieLogFormatTest {
 
+  @Rule
 
 Review comment:
   +1 on using the `@Rule` wondering if that will give us auto deletion.. I 
think it's deleted in afterClass of the test. 
https://stackoverflow.com/questions/16494459/why-isnt-junit-temporaryfolder-deleted
 ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #928: [HUDI-265] Failed to 
delete tmp dirs created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928#discussion_r329070467
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/table/log/HoodieLogFormatTest.java
 ##
 @@ -78,6 +80,8 @@
 @RunWith(Parameterized.class)
 public class HoodieLogFormatTest {
 
+  @Rule
+  public TemporaryFolder folder = new TemporaryFolder();
 
 Review comment:
   I am wondering if we should reuse @yanghua 's work here. May be create a 
thin HoodieCommonTestHarness class and make it extend the 
HoodieClientTestHarness and make these tests subclass HoodieCommonTestHarness? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #928: [HUDI-265] Failed to 
delete tmp dirs created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928#discussion_r329072018
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/util/TestFSUtils.java
 ##
 @@ -129,6 +129,7 @@ public void testProcessFiles() throws Exception {
 .noneMatch(s -> s.contains(HoodieTableMetaClient.METAFOLDER_NAME)));
 // Check if only files are listed including hoodie.properties
 Assert.assertEquals("Collected=" + collected2, 5, collected2.size());
+tmpFolder.delete();
 
 Review comment:
   wont this delete it for other tests as well? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #923: HUDI-247 Unify the initialization of HoodieTableMetaClient in test for hoodie-client module

2019-09-27 Thread GitBox
vinothchandar commented on a change in pull request #923: HUDI-247 Unify the 
initialization of HoodieTableMetaClient in test for hoodie-client module
URL: https://github.com/apache/incubator-hudi/pull/923#discussion_r329066558
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -477,4 +478,43 @@ public void setActiveTimeline(HoodieActiveTimeline 
activeTimeline) {
   public void setTableConfig(HoodieTableConfig tableConfig) {
 this.tableConfig = tableConfig;
   }
+
+  //--
+  //Resource initialization and cleanup
+  //--
+
+  private void init(Configuration conf, String basePath,
+  boolean loadActiveTimelineOnLoad, ConsistencyGuardConfig 
consistencyGuardConfig) {
+log.info("Loading HoodieTableMetaClient from " + basePath);
+this.basePath = basePath;
+this.consistencyGuardConfig = consistencyGuardConfig;
+this.hadoopConf = new SerializableConfiguration(conf);
+Path basePathDir = new Path(this.basePath);
+this.metaPath = new Path(basePath, METAFOLDER_NAME).toString();
+Path metaPathDir = new Path(this.metaPath);
+this.fs = getFs();
+DatasetNotFoundException.checkValidDataset(fs, basePathDir, metaPathDir);
+this.tableConfig = new HoodieTableConfig(fs, metaPath);
+this.tableType = tableConfig.getTableType();
+log.info("Finished Loading Table of type " + tableType + " from " + 
basePath);
+if (loadActiveTimelineOnLoad) {
+  log.info("Loading Active commit timeline for " + basePath);
+  getActiveTimeline();
+}
+  }
+
+  private void cleanup() throws IOException {
+this.basePath = null;
+this.consistencyGuardConfig = null;
+this.hadoopConf = null;
+this.metaPath = null;
+if (this.fs != null) {
 
 Review comment:
   ah.. lets not close the fs object... This instance will be cached at the 
FileSystem level and closing this here, would make other threads (e.g with 
num.cores > 1, other spark executor's thread on same jvm to fail). #620 we have 
worked through this before.. 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #926: [HUDI-278] Translate Administering page

2019-09-27 Thread GitBox
leesf commented on issue #926: [HUDI-278] Translate Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#issuecomment-535839642
 
 
   @yihua Thanks again for your careful review. Updated the PR to address your 
comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services