[ 
https://issues.apache.org/jira/browse/HADOOP-18839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Martynov updated HADOOP-18839:
------------------------------------
    Description: 
I've tried to connect from PySpark to Minio running in docker.

Installing PySpark and starting Minio:
{code:bash}
pip install pyspark==3.4.1

docker run --rm -d --hostname minio --name minio -p 9000:9000 -p 9001:9001 -e 
MINIO_ACCESS_KEY=access -e MINIO_SECRET_KEY=Eevoh2wo0ui6ech0wu8oy3feiR3eicha -e 
MINIO_ROOT_USER=admin -e MINIO_ROOT_PASSWORD=iepaegaigi3ofa9TaephieSo1iecaesh 
bitnami/minio:latest
docker exec minio mc mb test-bucket
{code}
Then create Spark session:
{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder\
          .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4")\
          .config("spark.hadoop.fs.s3a.endpoint", "localhost:9000")\
          .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "true")\
          .config("spark.hadoop.fs.s3a.path.style.access", "true")\
          .config("spark.hadoop.fs.s3a.access.key", "access")\
          .config("spark.hadoop.fs.s3a.secret.key", 
"Eevoh2wo0ui6ech0wu8oy3feiR3eicha")\
          .config("spark.hadoop.fs.s3a.aws.credentials.provider", 
"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")\
          .getOrCreate()
spark.sparkContext.setLogLevel("debug")
{code}
And try to access some object in a bucket:
{code:python}
import time

begin = time.perf_counter()
spark.read.format("csv").load("s3a://test-bucket/fake")
end = time.perf_counter()

py4j.protocol.Py4JJavaError: An error occurred while calling o40.load.
: org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
s3a://test-bucket/fake: com.amazonaws.SdkClientException: Unable to execute 
HTTP request: Unsupported or unrecognized SSL message: Unable to execute HTTP 
request: Unsupported or unrecognized SSL message
...
{code}
[^ssl.log]
{code:python}
>>> print((end-begin)/60, "min")
14.72387898775002 min
{code}
I was waiting almost *15 minutes* to get the exception from Spark. The reason 
was I tried to connect to endpoint with {{fs.s3a.connection.ssl.enabled=true}}, 
but Minio is configured to listen for HTTP protocol only.

Is there any way to immediately raise exception if SSL connection cannot be 
established?


If I try to pass wrong endpoint, like {{localhos:9000}}, I'll get exception 
like this in just 5 seconds:
{code:java}
: org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
s3a://test-bucket/fake: com.amazonaws.SdkClientException: Unable to execute 
HTTP request: test-bucket.localhos: Unable to execute HTTP request: 
test-bucket.localhos
...
{code}
[^host.log]
{code:python}
>>> print(end-begin, "sec")
5.700424307000503 sec
{code}
I know about options like {{fs.s3a.attempts.maximum}} and 
{{{}fs.s3a.retry.limit{}}}, setting them to 1 will cause raising exception just 
immediately. But this does not look right.

  was:
I've tried to connect from PySpark to Minio running in docker.

Installing PySpark and starting Minio:
{code:bash}
pip install pyspark==3.4.1

docker run --rm -d --hostname minio --name minio -p 9000:9000 -p 9001:9001 -e 
MINIO_ACCESS_KEY=access -e MINIO_SECRET_KEY=Eevoh2wo0ui6ech0wu8oy3feiR3eicha -e 
MINIO_ROOT_USER=admin -e MINIO_ROOT_PASSWORD=iepaegaigi3ofa9TaephieSo1iecaesh 
bitnami/minio:latest
docker exec minio mc mb test-bucket
{code}
Then create Spark session:
{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder\
          .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4")\
          .config("spark.hadoop.fs.s3a.endpoint", "localhost:9000")\
          .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "true")\
          .config("spark.hadoop.fs.s3a.path.style.access", "true")\
          .config("spark.hadoop.fs.s3a.access.key", "access")\
          .config("spark.hadoop.fs.s3a.secret.key", 
"Eevoh2wo0ui6ech0wu8oy3feiR3eicha")\
          .config("spark.hadoop.fs.s3a.aws.credentials.provider", 
"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")\
          .getOrCreate()
spark.sparkContext.setLogLevel("debug")
{code}
And try to access some object in a bucket:
{code:python}
import time

begin = time.perf_counter()
spark.read.format("csv").load("s3a://test-bucket/fake")
end = time.perf_counter()

py4j.protocol.Py4JJavaError: An error occurred while calling o40.load.
: org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
s3a://test-bucket/fake: com.amazonaws.SdkClientException: Unable to execute 
HTTP request: Unsupported or unrecognized SSL message: Unable to execute HTTP 
request: Unsupported or unrecognized SSL message
...
{code}
[^ssl.log]
{code:python}
>>> print((end-begin)/60, "min")
14.72387898775002 min
{code}
I was waiting almost *15 minutes* to get the exception from Spark. The reason 
was I tried to connect to endpoint with {{{s.s3a.connection.ssl.enabled=true}}, 
but Minio is configured to listen for HTTP protocol only.

Is there any way to immediately raise exception if SSL connection cannot be 
established?


If I try to pass wrong endpoint, like {{localhos:9000}}, I'll get exception 
like this in just 5 seconds:
{code:java}
: org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
s3a://test-bucket/fake: com.amazonaws.SdkClientException: Unable to execute 
HTTP request: test-bucket.localhos: Unable to execute HTTP request: 
test-bucket.localhos
...
{code}
[^host.log]
{code:python}
>>> print(end-begin, "sec")
5.700424307000503 sec
{code}
I know about options like {{fs.s3a.attempts.maximum}} and 
{{{}fs.s3a.retry.limit{}}}, setting them to 1 will cause raising exception just 
immediately. But this does not look right.


> s3a client SSLException is raised after very long timeout "Unsupported or 
> unrecognized SSL message"
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18839
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18839
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.3.4
>            Reporter: Maxim Martynov
>            Priority: Minor
>         Attachments: host.log, ssl.log
>
>
> I've tried to connect from PySpark to Minio running in docker.
> Installing PySpark and starting Minio:
> {code:bash}
> pip install pyspark==3.4.1
> docker run --rm -d --hostname minio --name minio -p 9000:9000 -p 9001:9001 -e 
> MINIO_ACCESS_KEY=access -e MINIO_SECRET_KEY=Eevoh2wo0ui6ech0wu8oy3feiR3eicha 
> -e MINIO_ROOT_USER=admin -e 
> MINIO_ROOT_PASSWORD=iepaegaigi3ofa9TaephieSo1iecaesh bitnami/minio:latest
> docker exec minio mc mb test-bucket
> {code}
> Then create Spark session:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder\
>           .config("spark.jars.packages", 
> "org.apache.hadoop:hadoop-aws:3.3.4")\
>           .config("spark.hadoop.fs.s3a.endpoint", "localhost:9000")\
>           .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "true")\
>           .config("spark.hadoop.fs.s3a.path.style.access", "true")\
>           .config("spark.hadoop.fs.s3a.access.key", "access")\
>           .config("spark.hadoop.fs.s3a.secret.key", 
> "Eevoh2wo0ui6ech0wu8oy3feiR3eicha")\
>           .config("spark.hadoop.fs.s3a.aws.credentials.provider", 
> "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")\
>           .getOrCreate()
> spark.sparkContext.setLogLevel("debug")
> {code}
> And try to access some object in a bucket:
> {code:python}
> import time
> begin = time.perf_counter()
> spark.read.format("csv").load("s3a://test-bucket/fake")
> end = time.perf_counter()
> py4j.protocol.Py4JJavaError: An error occurred while calling o40.load.
> : org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
> s3a://test-bucket/fake: com.amazonaws.SdkClientException: Unable to execute 
> HTTP request: Unsupported or unrecognized SSL message: Unable to execute HTTP 
> request: Unsupported or unrecognized SSL message
> ...
> {code}
> [^ssl.log]
> {code:python}
> >>> print((end-begin)/60, "min")
> 14.72387898775002 min
> {code}
> I was waiting almost *15 minutes* to get the exception from Spark. The reason 
> was I tried to connect to endpoint with 
> {{fs.s3a.connection.ssl.enabled=true}}, but Minio is configured to listen for 
> HTTP protocol only.
> Is there any way to immediately raise exception if SSL connection cannot be 
> established?
> If I try to pass wrong endpoint, like {{localhos:9000}}, I'll get exception 
> like this in just 5 seconds:
> {code:java}
> : org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
> s3a://test-bucket/fake: com.amazonaws.SdkClientException: Unable to execute 
> HTTP request: test-bucket.localhos: Unable to execute HTTP request: 
> test-bucket.localhos
> ...
> {code}
> [^host.log]
> {code:python}
> >>> print(end-begin, "sec")
> 5.700424307000503 sec
> {code}
> I know about options like {{fs.s3a.attempts.maximum}} and 
> {{{}fs.s3a.retry.limit{}}}, setting them to 1 will cause raising exception 
> just immediately. But this does not look right.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to