[jira] [Commented] (ARROW-14930) [Python] FileNotFound when using bucket+folders in S3 + partitioned parquet

Luis Morales (Jira) Wed, 08 Dec 2021 09:15:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455884#comment-17455884
 ]


Luis Morales commented on ARROW-14930:
--------------------------------------

I would say there is no problem in the server side. My thoughts on this:

 

*scality.get_file_info("dasynth/parquet/")* 

This method through HEAD opeartions is asking for buckets or objects, but in 
this case dasynth/parquet is none of them, it's just a prefix (or folder or 
tag... name it here the way you want). That's the reason why the server answers 
with object not found.

 

When using FileSelector is not asking previously if the object exists, it just 
asks for the contents with a GET method and in that case it works properly.

 

Maybe with a new parameter with object_type = [bucket, object, tag] and apply a 
different logic on each case:

bucket, object - > HEAD methods

tag -> the same logic as if it would use FileSelector.

 

would solve the problem 

 

additionally in the dataset() method things should be changed too according to 
this idea.

 

an additional  example. if you use get_file_info with a file like this:

 

scality.get_file_info("dasynth/parquet/taxies/2019/month_year=2001-01/payment_type=1/9ccd9d4ae28a41e1acaf40ea594b61da.snappy.parquet")

 

it works despite of the folders parquet, taxies, 2019...

> [Python] FileNotFound when using bucket+folders in S3 + partitioned parquet
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-14930
>                 URL: https://issues.apache.org/jira/browse/ARROW-14930
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 6.0.1
>         Environment: linux + python 3.8
>            Reporter: Luis Morales
>            Priority: Trivial
>             Fix For: 6.0.2
>
>
> When using dataset.Dataset with S3FileSystem with compatible S3 object 
> sotrage, get an FileNotFoundError.
>  
> My code:
>  
> scality = fs.S3FileSystem(access_key='accessKey1', 
> secret_key='verySecretKey1', endpoint_override="http://localhost:8000";, 
> region="")
> data = ds.dataset("dasynth/parquet/taxies/2019_june/", format="parquet", 
> partitioning="hive", filesystem=scality)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14930) [Python] FileNotFound when using bucket+folders in S3 + partitioned parquet

Reply via email to