[
https://issues.apache.org/jira/browse/HUDI-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-2083:
--------------------------------------
Description:
Hudi CLI gives exception when trying to connect to s3 path
{code:java}
create --path s3://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3
--tableType MERGE_ON_READ
Failed to get instance of org.apache.hadoop.fs.FileSystem
org.apache.hudi.exception.HoodieIOException: Failed to get instance of
org.apache.hadoop.fs.FileSystem
at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:98)
=========
create --path s3a://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3
--tableType MERGE_ON_READ
Command failed java.lang.RuntimeException: java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem
not found
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
{code}
This could be because target/lib folder does not contain hadoop-aws or aws-s3
dependency.
Update from Sivabalan:
Something that works for me even w/o the patch linked. If you wish to use
latest master Hudi-cli with S3 dataset. Just incase someone wants to try it out.
1. replace local hudi-cli.sh contents with
[this|https://gist.github.com/nsivabalan/a31d56891353fe84413951972484f21f].
2. do mvn package.
3. tar entire hudi-cli directory.
4. copy to emr master.
5. untar hudi-cli.tar
6. Ensure to set SPARK_HOME to /usr/lib/spark
7. download aws jars and copy to some directory.
mkdir client_jars && cd client_jars
export HADOOP_VERSION=3.2.0
wget
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar
-O hadoop-aws.jar
export AWS_SDK_VERSION=1.11.375
wget
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar
-O aws-java-sdk.jar
set
CLIENT_JARS=/home/hadoop/client_jars/aws-java-sdk.jar:/home/hadoop/client_jars/hadoop-aws.jar
8. and then launch hudi-cli.sh
I verified that cli ommands that launches spark succeeds w/ this for a S3
dataset.
With the patch from vinay, I am running into EMR FS issues.
Ethan: locally running hudi-cli with S3 Hudi table:
{code:java}
Build Hudi with corresponding Spark version
export AWS_REGION=us-east-2
export AWS_ACCESS_KEY_ID=<key_id>
export AWS_SECRET_ACCESS_KEY=<secret_key>
export SPARK_HOME=<spark_home>
# Note: AWS jar versions below are specific to Spark 3.2.0
export
CLIENT_JAR=/lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.12.48.jar:/lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.3.1.jar
./hudi-cli/hudi-cli.sh{code}
was:
Hudi CLI gives exception when trying to connect to s3 path
{code:java}
create --path s3://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3
--tableType MERGE_ON_READ
Failed to get instance of org.apache.hadoop.fs.FileSystem
org.apache.hudi.exception.HoodieIOException: Failed to get instance of
org.apache.hadoop.fs.FileSystem
at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:98)
=========
create --path s3a://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3
--tableType MERGE_ON_READ
Command failed java.lang.RuntimeException: java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem
not found
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
{code}
This could be because target/lib folder does not contain hadoop-aws or aws-s3
dependency.
Update from Sivabalan:
Something that works for me even w/o the patch linked. If you wish to use
latest master Hudi-cli with S3 dataset. Just incase someone wants to try it out.
1. replace local hudi-cli.sh contents with
[this|https://gist.github.com/nsivabalan/a31d56891353fe84413951972484f21f].
2. do mvn package.
3. tar entire hudi-cli directory.
4. copy to emr master.
5. untar and execute hudi-cli.sh
I verified that cli ommands that launches spark succeeds w/ this for a S3
dataset.
With the patch from vinay, I am running into EMR FS issues.
Ethan: locally running hudi-cli with S3 Hudi table:
{code:java}
Build Hudi with corresponding Spark version
export AWS_REGION=us-east-2
export AWS_ACCESS_KEY_ID=<key_id>
export AWS_SECRET_ACCESS_KEY=<secret_key>
export SPARK_HOME=<spark_home>
# Note: AWS jar versions below are specific to Spark 3.2.0
export
CLIENT_JAR=/lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.12.48.jar:/lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.3.1.jar
./hudi-cli/hudi-cli.sh{code}
> Hudi CLI does not work with S3
> ------------------------------
>
> Key: HUDI-2083
> URL: https://issues.apache.org/jira/browse/HUDI-2083
> Project: Apache Hudi
> Issue Type: Task
> Components: cli
> Reporter: Vinay
> Assignee: Vinay
> Priority: Major
> Labels: pull-request-available, query-eng, sev:high
>
> Hudi CLI gives exception when trying to connect to s3 path
> {code:java}
> create --path s3://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3
> --tableType MERGE_ON_READ
> Failed to get instance of org.apache.hadoop.fs.FileSystem
> org.apache.hudi.exception.HoodieIOException: Failed to get instance of
> org.apache.hadoop.fs.FileSystem
> at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:98)
> =========
> create --path s3a://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3
> --tableType MERGE_ON_READ
> Command failed java.lang.RuntimeException: java.lang.ClassNotFoundException:
> Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
> java.lang.ClassNotFoundException: Class
> org.apache.hadoop.fs.s3a.S3AFileSystem not found
> java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
> org.apache.hadoop.fs.s3a.S3AFileSystem not found
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
> at
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
> {code}
> This could be because target/lib folder does not contain hadoop-aws or aws-s3
> dependency.
>
> Update from Sivabalan:
> Something that works for me even w/o the patch linked. If you wish to use
> latest master Hudi-cli with S3 dataset. Just incase someone wants to try it
> out.
> 1. replace local hudi-cli.sh contents with
> [this|https://gist.github.com/nsivabalan/a31d56891353fe84413951972484f21f].
> 2. do mvn package.
> 3. tar entire hudi-cli directory.
> 4. copy to emr master.
> 5. untar hudi-cli.tar
> 6. Ensure to set SPARK_HOME to /usr/lib/spark
> 7. download aws jars and copy to some directory.
> mkdir client_jars && cd client_jars
> export HADOOP_VERSION=3.2.0
> wget
> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar
> -O hadoop-aws.jar
> export AWS_SDK_VERSION=1.11.375
> wget
> https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar
> -O aws-java-sdk.jar
> set
> CLIENT_JARS=/home/hadoop/client_jars/aws-java-sdk.jar:/home/hadoop/client_jars/hadoop-aws.jar
> 8. and then launch hudi-cli.sh
> I verified that cli ommands that launches spark succeeds w/ this for a S3
> dataset.
> With the patch from vinay, I am running into EMR FS issues.
>
> Ethan: locally running hudi-cli with S3 Hudi table:
> {code:java}
> Build Hudi with corresponding Spark version
> export AWS_REGION=us-east-2
> export AWS_ACCESS_KEY_ID=<key_id>
> export AWS_SECRET_ACCESS_KEY=<secret_key>
> export SPARK_HOME=<spark_home>
> # Note: AWS jar versions below are specific to Spark 3.2.0
> export
> CLIENT_JAR=/lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.12.48.jar:/lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.3.1.jar
> ./hudi-cli/hudi-cli.sh{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)