[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated PARQUET-1361: Fix Version/s: (was: cpp-7.0.0) > [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT > types > --- > > Key: PARQUET-1361 > URL: https://issues.apache.org/jira/browse/PARQUET-1361 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp, parquet-mr >Affects Versions: cpp-1.4.0 >Reporter: Ken Terada >Priority: Major > Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, > sample_w_null.csv > > > The parquet-cpp v1.4.1 library allows generation of parquet files with NULL > values for INT type columns which causes unexpected parsing errors in > downstream systems ingesting those files. > e.g., > {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive > type}} > *+Reproduction Steps+* > OS: CentOS 7.5.1804 > Python: 3.4.8 > +Prerequisites:+ > * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, > {{PyArrow: 0.9.0}} > +Step 1+ > Generate the parquet file. > {{sample_w_null.csv}} > {code} > col1,col2,col3,col4,col5 > 1,2,,4,5 > {code} > {{parquet-1361-repro-1.py}} > {code} > #!/usr/bin/python > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > input_file = 'sample_w_null.csv' > output_file = 'int_unknown.parquet' > p_schema = {'col1': np.int32, > 'col2': np.int32, > 'col3': np.unicode_, > 'col4': np.int32, > 'col5': np.int32} > df = pd.read_csv(input_file, dtype=p_schema) > table = pa.Table.from_pandas(df) > pq.write_table(table, output_file) > {code} > +Step 2+ > Inspect the metadata of the generated file. > {{parquet-1361-repro-2.py}} > {code} > #!/usr/bin/python > import pyarrow.parquet as pq > for filename in ['int_unknown.parquet']: > pq_file = pq.ParquetFile(filename) > print(pq_file.metadata) > print(pq_file.schema) > print(pq_file.num_row_groups) > print(pq.read_table(filename, > columns=['col1','col2','col3','col4','col5']).to_pandas()) > {code} > Results > {code} > > created_by: parquet-cpp version 1.4.1-SNAPSHOT > num_columns: 6 > num_rows: 1 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 1434 > > col1: INT32 > col2: INT32 > col3: INT32 UNKNOWN > col4: INT32 > col5: INT32 > __index_level_0__: INT64 > 1 >col1 col2 col3 col4 col5 > 0 1 2 None 4 5 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated PARQUET-1361: - Component/s: parquet-mr > [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT > types > --- > > Key: PARQUET-1361 > URL: https://issues.apache.org/jira/browse/PARQUET-1361 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp, parquet-mr >Affects Versions: cpp-1.4.0 >Reporter: Ken Terada >Priority: Major > Fix For: cpp-7.0.0 > > Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, > sample_w_null.csv > > > The parquet-cpp v1.4.1 library allows generation of parquet files with NULL > values for INT type columns which causes unexpected parsing errors in > downstream systems ingesting those files. > e.g., > {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive > type}} > *+Reproduction Steps+* > OS: CentOS 7.5.1804 > Python: 3.4.8 > +Prerequisites:+ > * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, > {{PyArrow: 0.9.0}} > +Step 1+ > Generate the parquet file. > {{sample_w_null.csv}} > {code} > col1,col2,col3,col4,col5 > 1,2,,4,5 > {code} > {{parquet-1361-repro-1.py}} > {code} > #!/usr/bin/python > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > input_file = 'sample_w_null.csv' > output_file = 'int_unknown.parquet' > p_schema = {'col1': np.int32, > 'col2': np.int32, > 'col3': np.unicode_, > 'col4': np.int32, > 'col5': np.int32} > df = pd.read_csv(input_file, dtype=p_schema) > table = pa.Table.from_pandas(df) > pq.write_table(table, output_file) > {code} > +Step 2+ > Inspect the metadata of the generated file. > {{parquet-1361-repro-2.py}} > {code} > #!/usr/bin/python > import pyarrow.parquet as pq > for filename in ['int_unknown.parquet']: > pq_file = pq.ParquetFile(filename) > print(pq_file.metadata) > print(pq_file.schema) > print(pq_file.num_row_groups) > print(pq.read_table(filename, > columns=['col1','col2','col3','col4','col5']).to_pandas()) > {code} > Results > {code} > > created_by: parquet-cpp version 1.4.1-SNAPSHOT > num_columns: 6 > num_rows: 1 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 1434 > > col1: INT32 > col2: INT32 > col3: INT32 UNKNOWN > col4: INT32 > col5: INT32 > __index_level_0__: INT64 > 1 >col1 col2 col3 col4 col5 > 0 1 2 None 4 5 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated PARQUET-1361: Fix Version/s: cpp-7.0.0 (was: cpp-6.0.0) > [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT > types > --- > > Key: PARQUET-1361 > URL: https://issues.apache.org/jira/browse/PARQUET-1361 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.4.0 >Reporter: Ken Terada >Priority: Major > Fix For: cpp-7.0.0 > > Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, > sample_w_null.csv > > > The parquet-cpp v1.4.1 library allows generation of parquet files with NULL > values for INT type columns which causes unexpected parsing errors in > downstream systems ingesting those files. > e.g., > {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive > type}} > *+Reproduction Steps+* > OS: CentOS 7.5.1804 > Python: 3.4.8 > +Prerequisites:+ > * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, > {{PyArrow: 0.9.0}} > +Step 1+ > Generate the parquet file. > {{sample_w_null.csv}} > {code} > col1,col2,col3,col4,col5 > 1,2,,4,5 > {code} > {{parquet-1361-repro-1.py}} > {code} > #!/usr/bin/python > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > input_file = 'sample_w_null.csv' > output_file = 'int_unknown.parquet' > p_schema = {'col1': np.int32, > 'col2': np.int32, > 'col3': np.unicode_, > 'col4': np.int32, > 'col5': np.int32} > df = pd.read_csv(input_file, dtype=p_schema) > table = pa.Table.from_pandas(df) > pq.write_table(table, output_file) > {code} > +Step 2+ > Inspect the metadata of the generated file. > {{parquet-1361-repro-2.py}} > {code} > #!/usr/bin/python > import pyarrow.parquet as pq > for filename in ['int_unknown.parquet']: > pq_file = pq.ParquetFile(filename) > print(pq_file.metadata) > print(pq_file.schema) > print(pq_file.num_row_groups) > print(pq.read_table(filename, > columns=['col1','col2','col3','col4','col5']).to_pandas()) > {code} > Results > {code} > > created_by: parquet-cpp version 1.4.1-SNAPSHOT > num_columns: 6 > num_rows: 1 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 1434 > > col1: INT32 > col2: INT32 > col3: INT32 UNKNOWN > col4: INT32 > col5: INT32 > __index_level_0__: INT64 > 1 >col1 col2 col3 col4 col5 > 0 1 2 None 4 5 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated PARQUET-1361: Fix Version/s: (was: cpp-5.0.0) cpp-6.0.0 > [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT > types > --- > > Key: PARQUET-1361 > URL: https://issues.apache.org/jira/browse/PARQUET-1361 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.4.0 >Reporter: Ken Terada >Priority: Major > Fix For: cpp-6.0.0 > > Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, > sample_w_null.csv > > > The parquet-cpp v1.4.1 library allows generation of parquet files with NULL > values for INT type columns which causes unexpected parsing errors in > downstream systems ingesting those files. > e.g., > {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive > type}} > *+Reproduction Steps+* > OS: CentOS 7.5.1804 > Python: 3.4.8 > +Prerequisites:+ > * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, > {{PyArrow: 0.9.0}} > +Step 1+ > Generate the parquet file. > {{sample_w_null.csv}} > {code} > col1,col2,col3,col4,col5 > 1,2,,4,5 > {code} > {{parquet-1361-repro-1.py}} > {code} > #!/usr/bin/python > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > input_file = 'sample_w_null.csv' > output_file = 'int_unknown.parquet' > p_schema = {'col1': np.int32, > 'col2': np.int32, > 'col3': np.unicode_, > 'col4': np.int32, > 'col5': np.int32} > df = pd.read_csv(input_file, dtype=p_schema) > table = pa.Table.from_pandas(df) > pq.write_table(table, output_file) > {code} > +Step 2+ > Inspect the metadata of the generated file. > {{parquet-1361-repro-2.py}} > {code} > #!/usr/bin/python > import pyarrow.parquet as pq > for filename in ['int_unknown.parquet']: > pq_file = pq.ParquetFile(filename) > print(pq_file.metadata) > print(pq_file.schema) > print(pq_file.num_row_groups) > print(pq.read_table(filename, > columns=['col1','col2','col3','col4','col5']).to_pandas()) > {code} > Results > {code} > > created_by: parquet-cpp version 1.4.1-SNAPSHOT > num_columns: 6 > num_rows: 1 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 1434 > > col1: INT32 > col2: INT32 > col3: INT32 UNKNOWN > col4: INT32 > col5: INT32 > __index_level_0__: INT64 > 1 >col1 col2 col3 col4 col5 > 0 1 2 None 4 5 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated PARQUET-1361: Fix Version/s: (was: cpp-4.0.0) cpp-5.0.0 > [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT > types > --- > > Key: PARQUET-1361 > URL: https://issues.apache.org/jira/browse/PARQUET-1361 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.4.0 >Reporter: Ken Terada >Priority: Major > Fix For: cpp-5.0.0 > > Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, > sample_w_null.csv > > > The parquet-cpp v1.4.1 library allows generation of parquet files with NULL > values for INT type columns which causes unexpected parsing errors in > downstream systems ingesting those files. > e.g., > {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive > type}} > *+Reproduction Steps+* > OS: CentOS 7.5.1804 > Python: 3.4.8 > +Prerequisites:+ > * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, > {{PyArrow: 0.9.0}} > +Step 1+ > Generate the parquet file. > {{sample_w_null.csv}} > {code} > col1,col2,col3,col4,col5 > 1,2,,4,5 > {code} > {{parquet-1361-repro-1.py}} > {code} > #!/usr/bin/python > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > input_file = 'sample_w_null.csv' > output_file = 'int_unknown.parquet' > p_schema = {'col1': np.int32, > 'col2': np.int32, > 'col3': np.unicode_, > 'col4': np.int32, > 'col5': np.int32} > df = pd.read_csv(input_file, dtype=p_schema) > table = pa.Table.from_pandas(df) > pq.write_table(table, output_file) > {code} > +Step 2+ > Inspect the metadata of the generated file. > {{parquet-1361-repro-2.py}} > {code} > #!/usr/bin/python > import pyarrow.parquet as pq > for filename in ['int_unknown.parquet']: > pq_file = pq.ParquetFile(filename) > print(pq_file.metadata) > print(pq_file.schema) > print(pq_file.num_row_groups) > print(pq.read_table(filename, > columns=['col1','col2','col3','col4','col5']).to_pandas()) > {code} > Results > {code} > > created_by: parquet-cpp version 1.4.1-SNAPSHOT > num_columns: 6 > num_rows: 1 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 1434 > > col1: INT32 > col2: INT32 > col3: INT32 UNKNOWN > col4: INT32 > col5: INT32 > __index_level_0__: INT64 > 1 >col1 col2 col3 col4 col5 > 0 1 2 None 4 5 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated PARQUET-1361: -- Fix Version/s: cpp-1.6.0 > [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT > types > --- > > Key: PARQUET-1361 > URL: https://issues.apache.org/jira/browse/PARQUET-1361 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.4.0 >Reporter: Ken Terada >Priority: Major > Fix For: cpp-1.6.0 > > Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, > sample_w_null.csv > > > The parquet-cpp v1.4.1 library allows generation of parquet files with NULL > values for INT type columns which causes unexpected parsing errors in > downstream systems ingesting those files. > e.g., > {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive > type}} > *+Reproduction Steps+* > OS: CentOS 7.5.1804 > Python: 3.4.8 > +Prerequisites:+ > * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, > {{PyArrow: 0.9.0}} > +Step 1+ > Generate the parquet file. > {{sample_w_null.csv}} > {code} > col1,col2,col3,col4,col5 > 1,2,,4,5 > {code} > {{parquet-1361-repro-1.py}} > {code} > #!/usr/bin/python > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > input_file = 'sample_w_null.csv' > output_file = 'int_unknown.parquet' > p_schema = {'col1': np.int32, > 'col2': np.int32, > 'col3': np.unicode_, > 'col4': np.int32, > 'col5': np.int32} > df = pd.read_csv(input_file, dtype=p_schema) > table = pa.Table.from_pandas(df) > pq.write_table(table, output_file) > {code} > +Step 2+ > Inspect the metadata of the generated file. > {{parquet-1361-repro-2.py}} > {code} > #!/usr/bin/python > import pyarrow.parquet as pq > for filename in ['int_unknown.parquet']: > pq_file = pq.ParquetFile(filename) > print(pq_file.metadata) > print(pq_file.schema) > print(pq_file.num_row_groups) > print(pq.read_table(filename, > columns=['col1','col2','col3','col4','col5']).to_pandas()) > {code} > Results > {code} > > created_by: parquet-cpp version 1.4.1-SNAPSHOT > num_columns: 6 > num_rows: 1 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 1434 > > col1: INT32 > col2: INT32 > col3: INT32 UNKNOWN > col4: INT32 > col5: INT32 > __index_level_0__: INT64 > 1 >col1 col2 col3 col4 col5 > 0 1 2 None 4 5 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Terada updated PARQUET-1361: Attachment: sample_w_null.csv parquet-1361-repro-2.py parquet-1361-repro-1.py > [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT > types > --- > > Key: PARQUET-1361 > URL: https://issues.apache.org/jira/browse/PARQUET-1361 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.4.0 >Reporter: Ken Terada >Priority: Major > Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, > sample_w_null.csv > > > The parquet-cpp v1.4.1 library allows generation of parquet files with NULL > values for INT type columns which causes unexpected parsing errors in > downstream systems ingesting those files. > e.g., > {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive > type}} > *+Reproduction Steps+* > OS: CentOS 7.5.1804 > Python: 3.4.8 > +Prerequisites:+ > * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, > {{PyArrow: 0.9.0}} > +Step 1+ > Generate the parquet file. > {{sample_w_null.csv}} > {code} > col1,col2,col3,col4,col5 > 1,2,,4,5 > {code} > {{parquet-1361-repro-1.py}} > {code} > #!/usr/bin/python > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > input_file = 'sample_w_null.csv' > output_file = 'int_unknown.parquet' > p_schema = {'col1': np.int32, > 'col2': np.int32, > 'col3': np.unicode_, > 'col4': np.int32, > 'col5': np.int32} > df = pd.read_csv(input_file, dtype=p_schema) > table = pa.Table.from_pandas(df) > pq.write_table(table, output_file) > {code} > +Step 2+ > Inspect the metadata of the generated file. > {{parquet-1361-repro-2.py}} > {code} > #!/usr/bin/python > import pyarrow.parquet as pq > for filename in ['int_unknown.parquet']: > pq_file = pq.ParquetFile(filename) > print(pq_file.metadata) > print(pq_file.schema) > print(pq_file.num_row_groups) > print(pq.read_table(filename, > columns=['col1','col2','col3','col4','col5']).to_pandas()) > {code} > Results > {code} > > created_by: parquet-cpp version 1.4.1-SNAPSHOT > num_columns: 6 > num_rows: 1 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 1434 > > col1: INT32 > col2: INT32 > col3: INT32 UNKNOWN > col4: INT32 > col5: INT32 > __index_level_0__: INT64 > 1 >col1 col2 col3 col4 col5 > 0 1 2 None 4 5 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Terada updated PARQUET-1361: Description: The parquet-cpp v1.4.1 library allows generation of parquet files with NULL values for INT type columns which causes unexpected parsing errors in downstream systems ingesting those files. e.g., {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive type}} *+Reproduction Steps+* OS: CentOS 7.5.1804 Python: 3.4.8 +Prerequisites:+ * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, {{PyArrow: 0.9.0}} +Step 1: + Generate the parquet file. {{sample_w_null.csv}} {code} col1,col2,col3,col4,col5 1,2,,4,5 {code} {{parquet-1361-repro-1.py}} {code} #!/usr/bin/python import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pandas as pd input_file = 'sample_w_null.csv' output_file = 'int_unknown.parquet' p_schema = {'col1': np.int32, 'col2': np.int32, 'col3': np.unicode_, 'col4': np.int32, 'col5': np.int32} df = pd.read_csv(input_file, dtype=p_schema) table = pa.Table.from_pandas(df) pq.write_table(table, output_file) {code} +Step 2:+ Inspect the metadata of the generated file. {{parquet-1361-repro-2.py}} {code} #!/usr/bin/python import pyarrow.parquet as pq for filename in ['int_unknown.parquet']: pq_file = pq.ParquetFile(filename) print(pq_file.metadata) print(pq_file.schema) print(pq_file.num_row_groups) print(pq.read_table(filename, columns=['col1','col2','col3','col4','col5']).to_pandas()) {code} Results {code} created_by: parquet-cpp version 1.4.1-SNAPSHOT num_columns: 6 num_rows: 1 num_row_groups: 1 format_version: 1.0 serialized_size: 1434 col1: INT32 col2: INT32 col3: INT32 UNKNOWN col4: INT32 col5: INT32 __index_level_0__: INT64 1 col1 col2 col3 col4 col5 0 1 2 None 4 5 {code} was: The parquet-cpp v1.4.1 library allows generation of parquet files with NULL values for INT type columns which causes unexpected parsing errors in downstream systems ingesting those files. e.g., {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive type}} +Reproduction Steps+ OS: CentOS 7.5.1804 Python: 3.4.8 +Prerequisites:+ * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, {{PyArrow: 0.9.0}} +Step 1: + Generate the parquet file. {{sample_w_null.csv}} {code} col1,col2,col3,col4,col5 1,2,,4,5 {code} {{parquet-1361-repro-1.py}} {code} #!/usr/bin/python import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pandas as pd input_file = 'sample_w_null.csv' output_file = 'int_unknown.parquet' p_schema = {'col1': np.int32, 'col2': np.int32, 'col3': np.unicode_, 'col4': np.int32, 'col5': np.int32} df = pd.read_csv(input_file, dtype=p_schema) table = pa.Table.from_pandas(df) pq.write_table(table, output_file) {code} +Step 2:+ Inspect the metadata of the generated file. {{parquet-1361-repro-2.py}} {code} #!/usr/bin/python import pyarrow.parquet as pq for filename in ['int_unknown.parquet']: pq_file = pq.ParquetFile(filename) print(pq_file.metadata) print(pq_file.schema) print(pq_file.num_row_groups) print(pq.read_table(filename, columns=['col1','col2','col3','col4','col5']).to_pandas()) {code} Results {code} created_by: parquet-cpp version 1.4.1-SNAPSHOT num_columns: 6 num_rows: 1 num_row_groups: 1 format_version: 1.0 serialized_size: 1434 col1: INT32 col2: INT32 col3: INT32 UNKNOWN col4: INT32 col5: INT32 __index_level_0__: INT64 1 col1 col2 col3 col4 col5 0 1 2 None 4 5 {code} > [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT > types > --- > > Key: PARQUET-1361 > URL: https://issues.apache.org/jira/browse/PARQUET-1361 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.4.0 >Reporter: Ken Terada >Priority: Major > > The parquet-cpp v1.4.1 library allows generation of parquet files with NULL > values for INT type columns which causes unexpected parsing errors in > downstream systems ingesting those files. > e.g., > {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive > type}} > *+Reproduction Steps+* > OS: CentOS 7.5.1804 > Python: 3.4.8 > +Prerequisites:+ > * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, > {{PyArrow: 0.9.0}} > +Step 1: + > Generate the parquet file. > {{sample_w_null.csv}} > {code} > col1,col2,col3,col4,col5 > 1,2,,4,5 > {code} > {{parquet-1361-repro-1.py}} > {code} > #!/usr/bin/python > import numpy as np > import pyarrow as pa > import
[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Terada updated PARQUET-1361: Description: The parquet-cpp v1.4.1 library allows generation of parquet files with NULL values for INT type columns which causes unexpected parsing errors in downstream systems ingesting those files. e.g., {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive type}} *+Reproduction Steps+* OS: CentOS 7.5.1804 Python: 3.4.8 +Prerequisites:+ * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, {{PyArrow: 0.9.0}} +Step 1+ Generate the parquet file. {{sample_w_null.csv}} {code} col1,col2,col3,col4,col5 1,2,,4,5 {code} {{parquet-1361-repro-1.py}} {code} #!/usr/bin/python import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pandas as pd input_file = 'sample_w_null.csv' output_file = 'int_unknown.parquet' p_schema = {'col1': np.int32, 'col2': np.int32, 'col3': np.unicode_, 'col4': np.int32, 'col5': np.int32} df = pd.read_csv(input_file, dtype=p_schema) table = pa.Table.from_pandas(df) pq.write_table(table, output_file) {code} +Step 2+ Inspect the metadata of the generated file. {{parquet-1361-repro-2.py}} {code} #!/usr/bin/python import pyarrow.parquet as pq for filename in ['int_unknown.parquet']: pq_file = pq.ParquetFile(filename) print(pq_file.metadata) print(pq_file.schema) print(pq_file.num_row_groups) print(pq.read_table(filename, columns=['col1','col2','col3','col4','col5']).to_pandas()) {code} Results {code} created_by: parquet-cpp version 1.4.1-SNAPSHOT num_columns: 6 num_rows: 1 num_row_groups: 1 format_version: 1.0 serialized_size: 1434 col1: INT32 col2: INT32 col3: INT32 UNKNOWN col4: INT32 col5: INT32 __index_level_0__: INT64 1 col1 col2 col3 col4 col5 0 1 2 None 4 5 {code} was: The parquet-cpp v1.4.1 library allows generation of parquet files with NULL values for INT type columns which causes unexpected parsing errors in downstream systems ingesting those files. e.g., {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive type}} *+Reproduction Steps+* OS: CentOS 7.5.1804 Python: 3.4.8 +Prerequisites:+ * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, {{PyArrow: 0.9.0}} +Step 1: + Generate the parquet file. {{sample_w_null.csv}} {code} col1,col2,col3,col4,col5 1,2,,4,5 {code} {{parquet-1361-repro-1.py}} {code} #!/usr/bin/python import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pandas as pd input_file = 'sample_w_null.csv' output_file = 'int_unknown.parquet' p_schema = {'col1': np.int32, 'col2': np.int32, 'col3': np.unicode_, 'col4': np.int32, 'col5': np.int32} df = pd.read_csv(input_file, dtype=p_schema) table = pa.Table.from_pandas(df) pq.write_table(table, output_file) {code} +Step 2:+ Inspect the metadata of the generated file. {{parquet-1361-repro-2.py}} {code} #!/usr/bin/python import pyarrow.parquet as pq for filename in ['int_unknown.parquet']: pq_file = pq.ParquetFile(filename) print(pq_file.metadata) print(pq_file.schema) print(pq_file.num_row_groups) print(pq.read_table(filename, columns=['col1','col2','col3','col4','col5']).to_pandas()) {code} Results {code} created_by: parquet-cpp version 1.4.1-SNAPSHOT num_columns: 6 num_rows: 1 num_row_groups: 1 format_version: 1.0 serialized_size: 1434 col1: INT32 col2: INT32 col3: INT32 UNKNOWN col4: INT32 col5: INT32 __index_level_0__: INT64 1 col1 col2 col3 col4 col5 0 1 2 None 4 5 {code} > [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT > types > --- > > Key: PARQUET-1361 > URL: https://issues.apache.org/jira/browse/PARQUET-1361 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.4.0 >Reporter: Ken Terada >Priority: Major > > The parquet-cpp v1.4.1 library allows generation of parquet files with NULL > values for INT type columns which causes unexpected parsing errors in > downstream systems ingesting those files. > e.g., > {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive > type}} > *+Reproduction Steps+* > OS: CentOS 7.5.1804 > Python: 3.4.8 > +Prerequisites:+ > * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, > {{PyArrow: 0.9.0}} > +Step 1+ > Generate the parquet file. > {{sample_w_null.csv}} > {code} > col1,col2,col3,col4,col5 > 1,2,,4,5 > {code} > {{parquet-1361-repro-1.py}} > {code} > #!/usr/bin/python > import numpy as np > import pyarrow as pa > import
[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Terada updated PARQUET-1361: Description: The parquet-cpp v1.4.1 library allows generation of parquet files with NULL values for INT type columns which causes unexpected parsing errors in downstream systems ingesting those files. e.g., {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive type}} +Reproduction Steps+ OS: CentOS 7.5.1804 Python: 3.4.8 +Prerequisites:+ * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, {{PyArrow: 0.9.0}} +Step 1: + Generate the parquet file. {{sample_w_null.csv}} {code} col1,col2,col3,col4,col5 1,2,,4,5 {code} {{parquet-1361-repro-1.py}} {code} #!/usr/bin/python import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pandas as pd input_file = 'sample_w_null.csv' output_file = 'int_unknown.parquet' p_schema = {'col1': np.int32, 'col2': np.int32, 'col3': np.unicode_, 'col4': np.int32, 'col5': np.int32} df = pd.read_csv(input_file, dtype=p_schema) table = pa.Table.from_pandas(df) pq.write_table(table, output_file) {code} +Step 2:+ Inspect the metadata of the generated file. {{parquet-1361-repro-2.py}} {code} #!/usr/bin/python import pyarrow.parquet as pq for filename in ['int_unknown.parquet']: pq_file = pq.ParquetFile(filename) print(pq_file.metadata) print(pq_file.schema) print(pq_file.num_row_groups) print(pq.read_table(filename, columns=['col1','col2','col3','col4','col5']).to_pandas()) {code} Results {code} created_by: parquet-cpp version 1.4.1-SNAPSHOT num_columns: 6 num_rows: 1 num_row_groups: 1 format_version: 1.0 serialized_size: 1434 col1: INT32 col2: INT32 col3: INT32 UNKNOWN col4: INT32 col5: INT32 __index_level_0__: INT64 1 col1 col2 col3 col4 col5 0 1 2 None 4 5 {code} Summary: [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types (was: [C++] 1.4.1 library allows creation of parquet file w/UNKNOWN INT data) > [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT > types > --- > > Key: PARQUET-1361 > URL: https://issues.apache.org/jira/browse/PARQUET-1361 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.4.0 >Reporter: Ken Terada >Priority: Major > > The parquet-cpp v1.4.1 library allows generation of parquet files with NULL > values for INT type columns which causes unexpected parsing errors in > downstream systems ingesting those files. > e.g., > {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive > type}} > +Reproduction Steps+ > OS: CentOS 7.5.1804 > Python: 3.4.8 > +Prerequisites:+ > * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, > {{PyArrow: 0.9.0}} > +Step 1: + > Generate the parquet file. > {{sample_w_null.csv}} > {code} > col1,col2,col3,col4,col5 > 1,2,,4,5 > {code} > {{parquet-1361-repro-1.py}} > {code} > #!/usr/bin/python > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > input_file = 'sample_w_null.csv' > output_file = 'int_unknown.parquet' > p_schema = {'col1': np.int32, > 'col2': np.int32, > 'col3': np.unicode_, > 'col4': np.int32, > 'col5': np.int32} > df = pd.read_csv(input_file, dtype=p_schema) > table = pa.Table.from_pandas(df) > pq.write_table(table, output_file) > {code} > +Step 2:+ > Inspect the metadata of the generated file. > {{parquet-1361-repro-2.py}} > {code} > #!/usr/bin/python > import pyarrow.parquet as pq > for filename in ['int_unknown.parquet']: > pq_file = pq.ParquetFile(filename) > print(pq_file.metadata) > print(pq_file.schema) > print(pq_file.num_row_groups) > print(pq.read_table(filename, > columns=['col1','col2','col3','col4','col5']).to_pandas()) > {code} > Results > {code} > > created_by: parquet-cpp version 1.4.1-SNAPSHOT > num_columns: 6 > num_rows: 1 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 1434 > > col1: INT32 > col2: INT32 > col3: INT32 UNKNOWN > col4: INT32 > col5: INT32 > __index_level_0__: INT64 > 1 >col1 col2 col3 col4 col5 > 0 1 2 None 4 5 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)