[jira] [Commented] (PARQUET-2026) Allow empty row in parquet file

Vitalii Diravka (Jira) Thu, 22 Apr 2021 13:29:08 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17329398#comment-17329398
 ]


Vitalii Diravka commented on PARQUET-2026:
------------------------------------------

[~gszadovszky] Thanks for checking this. Drill can store empty tables (schema 
without data) in other formats than Parquet. The Drill feature is that it can 
do CTAS with different formats. So user can choose what format to use for CTAS 
from the beginning of Drill usage and all tables will be created within that 
format. 
Some format as CSV and JSON have difficulties with special CTAS queries and 
PARQUET for sure wins in most cases. So this format is used as the default one 
for DRILL.
So with new change the possibility of CTAS with _limit 0_ for Parquet format is 
dropped.
The possibility to create empty tables with other files is possible for Parquet 
CTAS mode, but it will be some hybrid mode, not clear parquet files tables. And 
since Drill is not regular DB there is no some hybrid mode to create tables 
with different formats, where the main aim is just successfully create the 
table.

Therefore from the Drill perspective it would be great to have possibility to 
create empty parquet files and recognize them as valid, possibly with passing 
some explicit flag to the _endBlock()_ method. 
And my subjective point of view: in real world I think there are a lot of 
cases, where only schema is present and this info is still valuable. I think 
Parquet should be able to handle such kind of data.

> Allow empty row in parquet file
> -------------------------------
>
>                 Key: PARQUET-2026
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2026
>             Project: Parquet
>          Issue Type: Task
>          Components: parquet-mr
>    Affects Versions: 1.12.0
>            Reporter: Vitalii Diravka
>            Priority: Major
>              Labels: Drill, empty-file
>             Fix For: 1.13.0
>
>         Attachments: Screenshot from 2021-04-13 08-52-56.png
>
>
> PARQUET-1851 starts abandon to write parquet files with schema (meta 
> information), but with 0 rows, aka empty files.
> In result it prevent to store empty tables in DRILL by using parquet files, 
> for example:
> {code:java}
> CREATE TABLE dfs.tmp.%s AS SELECT * FROM cp.`employee.json` WHERE 1=0{code}
> {code:java}
> CREATE TABLE dfs.tmp.%s AS select * from 
> dfs.`parquet/alltypes_required.parquet` where `col_int` = 0{code}
> {code:java}
> create table dfs.tmp.%s as select * from 
> dfs.`parquet/empty/complex/empty_complex.parquet`{code}
> So PARQUET-1851 breaks the following test cases:
> {code:java}
> TestUntypedNull.testParquetTableCreation   
> TestParquetWriterEmptyFiles.testComplexEmptyFileSchema   
> TestParquetWriterEmptyFiles.testWriteEmptyFile   
> TestParquetWriterEmptyFiles.testWriteEmptyFileWithSchema   
> TestParquetWriterEmptyFiles.testWriteEmptySchemaChange 
> TestMetastoreCommands.testAnalyzeEmptyRequiredParquetTable  
> TestMetastoreCommands.testSelectEmptyRequiredParquetTable{code}
>  I suggest to use warning in the process of creating empty parquet files or 
> create alternative _endBlock_ for backward compatibility with other tools:
> !Screenshot from 2021-04-13 08-52-56.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-2026) Allow empty row in parquet file

Reply via email to