[jira] [Commented] (ASTERIXDB-2948) "Too many open files" on large data sets in Parquet/S3

Jira Wed, 25 Aug 2021 11:32:08 -0700


    [ 
https://issues.apache.org/jira/browse/ASTERIXDB-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404652#comment-17404652
 ]


Ingo Müller commented on ASTERIXDB-2948:
----------------------------------------

You can find the query 
[here|https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/blob/master/queries/query-8/query.sqlpp]
 and a tiny sample of the data 
[here|https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/tree/master/data]. 
(The JSON file and the Parquet file should represent the same data; both have 
1k outer-level records. I ran into the problem with 53M records in a Parquet 
file.)

This is the output of {{ulimit -a}}:

{{core file size (blocks, -c) 0}}
{{data seg size (kbytes, -d) unlimited}}
{{scheduling priority (-e) 0}}
{{file size (blocks, -f) unlimited}}
{{pending signals (-i) 62162}}
{{max locked memory (kbytes, -l) 64}}
{{max memory size (kbytes, -m) unlimited}}
{{open files (-n) 1024}}
{{pipe size (512 bytes, -p) 8}}
{{POSIX message queues (bytes, -q) 819200}}
{{real-time priority (-r) 0}}
{{stack size (kbytes, -s) 8192}}
{{cpu time (seconds, -t) unlimited}}
{{max user processes (-u) 4096}}
{{virtual memory (kbytes, -v) unlimited}}
{{file locks (-x) unlimited}}

I guess a limit of 1024 open files is too strict and AsterixDB is not supposed 
to work with this. I am surprised to see that being the default value in Amazon 
Linux, though.

I'll try to run my experiments with an increased ulimit and report again.

> "Too many open files" on large data sets in Parquet/S3
> ------------------------------------------------------
>
>                 Key: ASTERIXDB-2948
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2948
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: EXT - External data
>    Affects Versions: 0.9.8
>            Reporter: Ingo Müller
>            Priority: Major
>
> When I run complex queries on a very large machine (96 vCPUs, 48 configured 
> IO devices/partitions) with Parquet files on S3, I occasionally get the 
> following error:
> {{java.io.FileNotFoundException: 
> /data/asterixdb/iodevice40/./ExternalSortGroupByRunGenerator13134601214093461962.waf
>  (Too many open files)}}
> This only happens after a certain size; I think the smallest instance of the 
> data set where I observed the error was around 0.5TB. I have not been able to 
> test these queries with files on HDFS or the local filesystem since they do 
> not fit onto the disk of the system.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ASTERIXDB-2948) "Too many open files" on large data sets in Parquet/S3

Reply via email to