[
https://issues.apache.org/jira/browse/BEAM-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986339#comment-16986339
]
Tomo Suzuki edited comment on BEAM-8822 at 12/2/19 11:05 PM:
-------------------------------------------------------------
There are two modules that use Hadoop client dependencies: hadoop-format and
hadoop-file-system.
Note that iemejia mentioned that "The hadoop modules (both hadoop-file-system
and hadoop-format) are provided by the user (commonly by relying on the ones
available in their clusters)". This means Hadoop_common, hadoop_hdfs
dependencies are marked as {{provided}}:
{noformat}
provided library.java.hadoop_client
provided library.java.hadoop_common
provided library.java.hadoop_mapreduce_client_core
{noformat}
h1. sdks/java/io/hadoop-format
As per [Hadoop Input/Output Format
IO|https://beam.apache.org/documentation/io/built-in/hadoop/], HadoopFormatIO
(in beam-sdks-java-io-hadoop-format artifact) is not just reading files from
Hadoop, but serves the fundamental class for other file formats such as
Cassandra, HBase, and even Elasticsearch.
Its integration test HadoopFormatIOIT uses PostgreSQL. Setting up PostgreSQL
instance in local MacBook and running HadoopFormatIOIT with IntelliJ worked.
{noformat}
--tests
org.apache.beam.sdk.io.hadoop.format.HadoopFormatIOIT
-DintegrationTestPipelineOptions='[
"--postgresServerName=localhost",
"--postgresUsername=suztomo",
"--postgresDatabaseName=suztomo",
"--postgresPassword=",
"--postgresSsl=false",
"--numberOfRecords=1000"
]'{noformat}
h1. sdks/java/io/hadoop-file-system
HadoopFileSystem is in sdks/java/io/hadoop-file-system module. Its test
HadoopFileSystemTest creates MiniDFSCluster (hadoop-hdfs artifact) and confirms
interaction with it through create and read files. Beam's HadoopFileSystem
class provides functions such as {{match}}, {{create}}, {{open}}, {{copy}}, and
etc.
My initial thought on testing compatibility of Hadoop dependency is to check
such communication between new HDFS and old HDFS client.
But where is HadoopFileSystem used?
As per [A review of input streaming
connectors|https://beam.apache.org/blog/2018/08/20/review-input-streaming-connectors.html],
HadoopFileSystem is used when FileIO takes a URL with {{hdfs://}}.
was (Author: suztomo):
There are two modules that use Hadoop client dependencies: hadoop-format and
hadoop-file-system.
h1. sdks/java/io/hadoop-format
As per [Hadoop Input/Output Format
IO|https://beam.apache.org/documentation/io/built-in/hadoop/], HadoopFormatIO
(in beam-sdks-java-io-hadoop-format artifact) is not just reading files from
Hadoop, but serves the fundamental class for other file formats such as
Cassandra, HBase, and even Elasticsearch.
Its integration test HadoopFormatIOIT uses PostgreSQL. Setting up PostgreSQL
instance in local MacBook and running HadoopFormatIOIT with IntelliJ worked.
{noformat}
--tests
org.apache.beam.sdk.io.hadoop.format.HadoopFormatIOIT
-DintegrationTestPipelineOptions='[
"--postgresServerName=localhost",
"--postgresUsername=suztomo",
"--postgresDatabaseName=suztomo",
"--postgresPassword=",
"--postgresSsl=false",
"--numberOfRecords=1000"
]'{noformat}
h1. sdks/java/io/hadoop-file-system
HadoopFileSystem is in sdks/java/io/hadoop-file-system module. Its test
HadoopFileSystemTest creates MiniDFSCluster (hadoop-hdfs artifact) and confirms
interaction with it through create and read files. Beam's HadoopFileSystem
class provides functions such as {{match}}, {{create}}, {{open}}, {{copy}}, and
etc.
My initial thought on testing compatibility of Hadoop dependency is to check
such communication between new HDFS and old HDFS client.
But where is HadoopFileSystem used?
As per [A review of input streaming
connectors|https://beam.apache.org/blog/2018/08/20/review-input-streaming-connectors.html],
HadoopFileSystem is used when FileIO takes a URL with {{hdfs://}}.
> Hadoop Client version 2.8 from 2.7
> ----------------------------------
>
> Key: BEAM-8822
> URL: https://issues.apache.org/jira/browse/BEAM-8822
> Project: Beam
> Issue Type: Bug
> Components: build-system
> Reporter: Tomo Suzuki
> Assignee: Tomo Suzuki
> Priority: Major
> Attachments: OGuVu0A18jJ.png
>
> Time Spent: 4h 20m
> Remaining Estimate: 0h
>
> [~iemejia] says:
> bq. probably a quicker way forward is to unblock the bigtable issue is to
> move our Hadoop dependency to Hadoop 2.8 given that Hadoop 2.7 is now EOL we
> have a good reason to do so
> https://cwiki.apache.org/confluence/display/HADOOP/EOL+%28End-of-life%29+Release+Branches
> The URL says
> {quote}Following branches are EOL:
> [2.0.x - 2.7.x]{quote}
> https://issues.apache.org/jira/browse/BEAM-8569?focusedCommentId=16980532&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16980532
> About compatibility with other library:
> Hadoop client 2.7 is not compatible with Guava > 21 because of
> Objects.toStringHelper. Fortunately Hadoop client 2.8 removed the use of the
> method
> ([detail|https://github.com/GoogleCloudPlatform/cloud-opensource-java/issues/1028#issuecomment-557709027]).
> 2.8.5 is the latest in 2.8.X.
> !OGuVu0A18jJ.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)