[jira] [Comment Edited] (ARROW-14930) [C++][Python] FileNotFound with Scality accessed through S3 APIs

Luis Morales (Jira) Wed, 15 Dec 2021 14:35:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459982#comment-17459982
 ]


Luis Morales edited comment on ARROW-14930 at 12/15/21, 10:34 PM:
------------------------------------------------------------------

I've tried your tests but with local storage, not memory (and with SSL).

docker run -v /cto/ring/data:/usr/src/app/localData -v 
/cto/ring/metadata:/usr/src/app/localMetadata -p 8000:8000 -e SSL=TRUE -e 
ENDPOINT=s3.scality.test -e LOG_LEVEL=trace -e REMOTE_MANAGEMENT_DISABLE=1 
zenko/cloudserver

I create bucket with s3cmd:

s3cmd mb s3://bucket

 

Then put the file with folders:

s3cmd put hola.txt s3://bucket/foo/bar.txt

 

Now with pyarrow:

scality = fs.S3FileSystem(access_key='accessKey1', secret_key='verySecretKey1', 
endpoint_override="https://s3.scality.test:8000";, scheme='https')

*scality.get_file_info(fs.FileSelector('/', recursive=True)). OK.*

[<FileInfo for 'bucket': type=FileType.Directory>,
<FileInfo for 'bucket/foo': type=FileType.Directory>,

<FileInfo for 'bucket/foo/bar.txt': type=FileType.File, size=11>]

 

*scality.get_file_info('bucket'). OK.*

<FileInfo for 'bucket': type=FileType.Directory>

 

*scality.get_file_info('bucket/foo'). OK.*

<FileInfo for 'bucket/foo/bar': type=FileType.Directory>

 

I think this works because foo has a file inside. Lets try a more 
parquet-oriented use case (the first folder has no file, just the folders that 
correspond to partitions, hive style)

 

s3cmd put hola.txt s3://bucket/foo/bar/hello.txt

*scality.get_file_info(fs.FileSelector('bucket', recursive=True)). OK*

[<FileInfo for 'bucket/foo': type=FileType.Directory>,

<FileInfo for 'bucket/foo/bar': type=FileType.Directory>,

<FileInfo for 'bucket/foo/bar/hola.txt': type=FileType.File, size=11>]

 

{color:#ff0000}*scality.get_file_info('bucket/foo'). KO*{color}

{color:#ff0000}*<FileInfo for 'bucket/foo': type=FileType.NotFound>*{color}

 

*scality.get_file_info('bucket/foo/bar'). OK*

<FileInfo for 'bucket/foo/bar': type=FileType.Directory>

 

The point is that from a parquet perspective when I create a dataset I have to 
ask for the "foo" folder (the one that has all the partitions inside), not the 
"bar" one (will be equivalent to one particular partition).

 

 

 


was (Author: JIRAUSER281000):
I've tried your tests but with local storage, not memory (and with SSL).

docker run -v /cto/ring/data:/usr/src/app/localData -v 
/cto/ring/metadata:/usr/src/app/localMetadata -p 8000:8000 -e SSL=TRUE -e 
ENDPOINT=s3.scality.test -e LOG_LEVEL=trace -e REMOTE_MANAGEMENT_DISABLE=1 
zenko/cloudserver

I create bucket with s3cmd:

s3cmd mb s3://bucket

 

Then put the file with folders:

s3cmd put hola.txt s3://bucket/foo/bar.txt

 

Now with pyarrow:

scality = fs.S3FileSystem(access_key='accessKey1', secret_key='verySecretKey1', 
endpoint_override="https://s3.scality.test:8000";, scheme='https')

*scality.get_file_info(fs.FileSelector('/', recursive=True)). OK.*

[<FileInfo for 'bucket': type=FileType.Directory>,
<FileInfo for 'bucket/foo': type=FileType.Directory>,

<FileInfo for 'bucket/foo/bar.txt': type=FileType.File, size=11>]

 

*scality.get_file_info('bucket'). OK.*

<FileInfo for 'bucket': type=FileType.Directory>

 

*scality.get_file_info('bucket/foo'). KO.*

<FileInfo for 'bucket/foo': type=FileType.NotFound>

 

I think this works because foo has a file inside. Lets try a more 
parquet-oriented use case (the first folder has no file, just the folders that 
correspond to partitions, hive style)

 

s3cmd put hola.txt s3://bucket/foo/bar/hello.txt

*scality.get_file_info(fs.FileSelector('bucket', recursive=True)). OK*

[<FileInfo for 'bucket/foo': type=FileType.Directory>,

<FileInfo for 'bucket/foo/bar': type=FileType.Directory>,

<FileInfo for 'bucket/foo/bar/hola.txt': type=FileType.File, size=11>]

 

{color:#FF0000}*scality.get_file_info('bucket/foo'). KO*{color}

{color:#FF0000}*<FileInfo for 'bucket/foo': type=FileType.NotFound>*{color}

 

*scality.get_file_info('bucket/foo/bar'). OK*

<FileInfo for 'bucket/foo/bar': type=FileType.Directory>

 

The point is that from a parquet perspective when I create a dataset I have to 
ask for the "foo" folder (the one that has all the partitions inside), not the 
"bar" one (will be equivalent to one particular partition).

 

 

 

> [C++][Python] FileNotFound with Scality accessed through S3 APIs
> ----------------------------------------------------------------
>
>                 Key: ARROW-14930
>                 URL: https://issues.apache.org/jira/browse/ARROW-14930
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 6.0.1
>         Environment: linux + python 3.8
>            Reporter: Luis Morales
>            Priority: Major
>              Labels: s3
>             Fix For: 6.0.2
>
>
> When using dataset.Dataset with S3FileSystem with compatible S3 object 
> sotrage, get an FileNotFoundError.
>  
> My code:
>  
> scality = fs.S3FileSystem(access_key='accessKey1', 
> secret_key='verySecretKey1', endpoint_override="http://localhost:8000";, 
> region="")
> data = ds.dataset("dasynth/parquet/taxies/2019_june/", format="parquet", 
> partitioning="hive", filesystem=scality)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-14930) [C++][Python] FileNotFound with Scality accessed through S3 APIs

Reply via email to