[ 
https://issues.apache.org/jira/browse/PARQUET-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330424#comment-17330424
 ] 

Gabor Szadovszky commented on PARQUET-2026:
-------------------------------------------

[~vitalii], thanks for the explanation. I still think that an empty table 
should not require an empty parquet file to be created. Meanwhile, I am not 
against allowing to create an empty parquet file but we have to investigate 
this carefully. Is the format itself allow to logically create an empty file? 
E.g. what should be the accepted value for data/dictionary page offsets? (These 
are required fields.) If we think the format allows this we shall write proper 
unit tests in parquet-mr to ensure we can handle empty files in any 
scenarios/with any bindings. Even though it is a regression we could not catch 
it because we did not have any unit tests for it. I think, the ability to 
create empty files was more a hidden feature than an intentional one. If we 
re-introduce this feature we shall do it properly.

> Allow empty row in parquet file
> -------------------------------
>
>                 Key: PARQUET-2026
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2026
>             Project: Parquet
>          Issue Type: Task
>          Components: parquet-mr
>    Affects Versions: 1.12.0
>            Reporter: Vitalii Diravka
>            Priority: Major
>              Labels: Drill, empty-file
>             Fix For: 1.13.0
>
>         Attachments: Screenshot from 2021-04-13 08-52-56.png
>
>
> PARQUET-1851 starts abandon to write parquet files with schema (meta 
> information), but with 0 rows, aka empty files.
> In result it prevent to store empty tables in DRILL by using parquet files, 
> for example:
> {code:java}
> CREATE TABLE dfs.tmp.%s AS SELECT * FROM cp.`employee.json` WHERE 1=0{code}
> {code:java}
> CREATE TABLE dfs.tmp.%s AS select * from 
> dfs.`parquet/alltypes_required.parquet` where `col_int` = 0{code}
> {code:java}
> create table dfs.tmp.%s as select * from 
> dfs.`parquet/empty/complex/empty_complex.parquet`{code}
> So PARQUET-1851 breaks the following test cases:
> {code:java}
> TestUntypedNull.testParquetTableCreation   
> TestParquetWriterEmptyFiles.testComplexEmptyFileSchema   
> TestParquetWriterEmptyFiles.testWriteEmptyFile   
> TestParquetWriterEmptyFiles.testWriteEmptyFileWithSchema   
> TestParquetWriterEmptyFiles.testWriteEmptySchemaChange 
> TestMetastoreCommands.testAnalyzeEmptyRequiredParquetTable  
> TestMetastoreCommands.testSelectEmptyRequiredParquetTable{code}
>  I suggest to use warning in the process of creating empty parquet files or 
> create alternative _endBlock_ for backward compatibility with other tools:
> !Screenshot from 2021-04-13 08-52-56.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to