[ 
https://issues.apache.org/jira/browse/BEAM-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986339#comment-16986339
 ] 

Tomo Suzuki edited comment on BEAM-8822 at 12/2/19 11:05 PM:
-------------------------------------------------------------

There are two modules that use Hadoop client dependencies: hadoop-format and 
hadoop-file-system.


Note that iemejia mentioned that "The hadoop modules (both hadoop-file-system 
and hadoop-format) are provided by the user (commonly by relying on the ones 
available in their clusters)". This means Hadoop_common, hadoop_hdfs 
dependencies are marked as {{provided}}:

{noformat}
  provided library.java.hadoop_client
  provided library.java.hadoop_common
  provided library.java.hadoop_mapreduce_client_core
{noformat}


h1. sdks/java/io/hadoop-format

As per [Hadoop Input/Output Format 
IO|https://beam.apache.org/documentation/io/built-in/hadoop/], HadoopFormatIO 
(in beam-sdks-java-io-hadoop-format artifact) is not just reading files from 
Hadoop, but serves the fundamental class for other file formats such as 
Cassandra, HBase, and even Elasticsearch.

Its integration test HadoopFormatIOIT uses PostgreSQL. Setting up PostgreSQL 
instance in local MacBook and running HadoopFormatIOIT with IntelliJ worked.
{noformat}
--tests
org.apache.beam.sdk.io.hadoop.format.HadoopFormatIOIT
-DintegrationTestPipelineOptions='[
"--postgresServerName=localhost",
"--postgresUsername=suztomo",
"--postgresDatabaseName=suztomo",
"--postgresPassword=",
"--postgresSsl=false",
"--numberOfRecords=1000"
]'{noformat}


h1. sdks/java/io/hadoop-file-system

HadoopFileSystem is in sdks/java/io/hadoop-file-system module. Its test 
HadoopFileSystemTest creates MiniDFSCluster (hadoop-hdfs artifact) and confirms 
interaction with it through create and read files. Beam's HadoopFileSystem 
class provides functions such as {{match}}, {{create}}, {{open}}, {{copy}}, and 
etc.

My initial thought on testing compatibility of Hadoop dependency is to check 
such communication between new HDFS and old HDFS client.

But where is HadoopFileSystem used?

As per [A review of input streaming 
connectors|https://beam.apache.org/blog/2018/08/20/review-input-streaming-connectors.html],
 HadoopFileSystem is used when FileIO takes a URL with {{hdfs://}}.


was (Author: suztomo):
There are two modules that use Hadoop client dependencies: hadoop-format and 
hadoop-file-system.

h1. sdks/java/io/hadoop-format

As per [Hadoop Input/Output Format 
IO|https://beam.apache.org/documentation/io/built-in/hadoop/], HadoopFormatIO 
(in beam-sdks-java-io-hadoop-format artifact) is not just reading files from 
Hadoop, but serves the fundamental class for other file formats such as 
Cassandra, HBase, and even Elasticsearch.

Its integration test HadoopFormatIOIT uses PostgreSQL. Setting up PostgreSQL 
instance in local MacBook and running HadoopFormatIOIT with IntelliJ worked.
{noformat}
--tests
org.apache.beam.sdk.io.hadoop.format.HadoopFormatIOIT
-DintegrationTestPipelineOptions='[
"--postgresServerName=localhost",
"--postgresUsername=suztomo",
"--postgresDatabaseName=suztomo",
"--postgresPassword=",
"--postgresSsl=false",
"--numberOfRecords=1000"
]'{noformat}


h1. sdks/java/io/hadoop-file-system

HadoopFileSystem is in sdks/java/io/hadoop-file-system module. Its test 
HadoopFileSystemTest creates MiniDFSCluster (hadoop-hdfs artifact) and confirms 
interaction with it through create and read files. Beam's HadoopFileSystem 
class provides functions such as {{match}}, {{create}}, {{open}}, {{copy}}, and 
etc.

My initial thought on testing compatibility of Hadoop dependency is to check 
such communication between new HDFS and old HDFS client.

But where is HadoopFileSystem used?

As per [A review of input streaming 
connectors|https://beam.apache.org/blog/2018/08/20/review-input-streaming-connectors.html],
 HadoopFileSystem is used when FileIO takes a URL with {{hdfs://}}.

> Hadoop Client version 2.8 from 2.7
> ----------------------------------
>
>                 Key: BEAM-8822
>                 URL: https://issues.apache.org/jira/browse/BEAM-8822
>             Project: Beam
>          Issue Type: Bug
>          Components: build-system
>            Reporter: Tomo Suzuki
>            Assignee: Tomo Suzuki
>            Priority: Major
>         Attachments: OGuVu0A18jJ.png
>
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> [~iemejia] says:
> bq. probably a quicker way forward is to unblock the bigtable issue is to 
> move our Hadoop dependency to Hadoop 2.8 given that Hadoop 2.7 is now EOL we 
> have a good reason to do so 
> https://cwiki.apache.org/confluence/display/HADOOP/EOL+%28End-of-life%29+Release+Branches
> The URL says
> {quote}Following branches are EOL: 
> [2.0.x - 2.7.x]{quote}
> https://issues.apache.org/jira/browse/BEAM-8569?focusedCommentId=16980532&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16980532
> About compatibility with other library:
> Hadoop client 2.7 is not compatible with Guava > 21 because of 
> Objects.toStringHelper. Fortunately Hadoop client 2.8 removed the use of the 
> method 
> ([detail|https://github.com/GoogleCloudPlatform/cloud-opensource-java/issues/1028#issuecomment-557709027]).
> 2.8.5 is the latest in 2.8.X.
>  !OGuVu0A18jJ.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to