[jira] [Updated] (IMPALA-733) Improve Parquet error handling for low disk space

Tim Armstrong (Jira) Wed, 23 Dec 2020 14:30:04 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Armstrong updated IMPALA-733:
---------------------------------
    Labels: supportability  (was: )

> Improve Parquet error handling for low disk space
> -------------------------------------------------
>
>                 Key: IMPALA-733
>                 URL: https://issues.apache.org/jira/browse/IMPALA-733
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 1.2.3
>         Environment: Less than 1GB free on the filesystem where HDFS resides.
>            Reporter: John Russell
>            Priority: Minor
>              Labels: supportability
>
> If HDFS has less than 1 GB free (or I presume whatever value is set in the 
> PARQUET_FILE_SIZE query option), INSERT into a Parquet table fails even for 
> tiny amounts of data. That might be unavoidable, but the error should be 
> communicated more clearly to the user.
> INSERT ... VALUES reports that N rows were inserted (no error at all), but 
> the expected data is missing when the table is queried.
> INSERT ... SELECT gives a cryptic error message but still reports that the 
> rows were inserted, although they aren't.
> Repro:
> About 400MB free. (This is a VM that keeps getting filled up by 
> Impala-related logs.)
> $ df -k .
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/vda1             24607156  23961976    395184  99% /
> I was going to answer a question on the mailing list by showing an INSERT 
> going from an unpartitioned to a partitioned table.
> [localhost:21000] > create table unpart (year int, s string) stored as 
> parquet;
> Query: create table unpart (year int, s string) stored as parquet
> Returned 0 row(s) in 0.12s
> INSERT ... VALUES looks like it succeeds, but the data isn't really there.
> [localhost:21000] > insert into unpart values (2013,'Happy'),(2014,'New 
> Year');
> Query: insert into unpart values (2013,'Happy'),(2014,'New Year')
> Inserted 2 rows in 0.22s
> [localhost:21000] > select * from unpart;
> Query: select * from unpart
> Returned 0 row(s) in 0.22s
> [localhost:21000] > select * from unpart;
> Query: select * from unpart
> Returned 0 row(s) in 0.22s
> Copying the data out of a text table, the error is reported but it doesn't 
> say specifically "out of space". And the "Inserted 2 rows" message raises the 
> hope the data made it in, but it didn't.
> [localhost:21000] > insert into unpart select * from t1;
> Query: insert into unpart select * from t1
> ERRORS ENCOUNTERED DURING EXECUTION: Backend 0:Failed to close HDFS file: 
> hdfs://127.0.0.1:8020/user/hive/warehouse/partitioning.db/unpart/.impala_insert_staging/284cf98f761aec95_5712ef093b357195//.2903970254304242837-6274340053807624598_1840160694_dir/2903970254304242837-6274340053807624598_1083629803_data.0
> Error(255): Unknown error 255
> Inserted 2 rows in 0.34s
> [localhost:21000] > select * from unpart;
> Query: select * from unpart
> Returned 0 row(s) in 0.22s
> After all this, the data directory contains a leftover staging subdirectory 
> (empty) and a zero-byte data file:
> $ hdfs dfs -ls 
> hdfs://127.0.0.1:8020/user/hive/warehouse/partitioning.db/unpart
> Found 2 items
> drwxrwxrwx   - impala supergroup          0 2014-01-08 11:39 
> hdfs://127.0.0.1:8020/user/hive/warehouse/partitioning.db/unpart/.impala_insert_staging
> -rw-r--r--   1 impala supergroup          0 2014-01-08 11:39 
> hdfs://127.0.0.1:8020/user/hive/warehouse/partitioning.db/unpart/3188829493227009611-3605612775229973420_1967882694_data.0
> Suggestions:
> - Make INSERT ... VALUES detect/report the HDFS error trying to write the 
> block. Don't report number of rows inserted.
> - Make INSERT ... SELECT error clearer, either suggest it could be 
> out-of-space or do some followup check for $(PARQUET_FILE_SIZE) space free. 
> Don't report number of rows inserted.
> - Be cleaner about leftover staging directories and empty data files. 
> (Shouldn't the data file stay in the staging directory until it's 
> successfully closed?)
> - Whatever distributed is checking is needed so the error is handled if it's 
> a remote node that runs out of space, rather than the coordinator node like 
> in this case with a single VM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Updated] (IMPALA-733) Improve Parquet error handling for low disk space

Reply via email to