[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2022-01-04 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1361:

Fix Version/s: (was: cpp-7.0.0)

> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp, parquet-mr
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
> Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, 
> sample_w_null.csv
>
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1+
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> input_file = 'sample_w_null.csv'
> output_file = 'int_unknown.parquet'
> p_schema = {'col1': np.int32,
> 'col2': np.int32,
> 'col3': np.unicode_,
> 'col4': np.int32,
> 'col5': np.int32}
> df = pd.read_csv(input_file, dtype=p_schema)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, output_file)
> {code}
> +Step 2+
> Inspect the metadata of the generated file.
> {{parquet-1361-repro-2.py}}
> {code}
> #!/usr/bin/python
> import pyarrow.parquet as pq
> for filename in ['int_unknown.parquet']:
> pq_file = pq.ParquetFile(filename)
> print(pq_file.metadata)
> print(pq_file.schema)
> print(pq_file.num_row_groups)
> print(pq.read_table(filename, 
> columns=['col1','col2','col3','col4','col5']).to_pandas())
> {code}
> Results
> {code}
> 
>   created_by: parquet-cpp version 1.4.1-SNAPSHOT
>   num_columns: 6
>   num_rows: 1
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 1434
> 
> col1: INT32
> col2: INT32
> col3: INT32 UNKNOWN
> col4: INT32
> col5: INT32
> __index_level_0__: INT64
> 1
>col1  col2  col3  col4  col5
> 0 1 2  None 4 5
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2022-01-03 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated PARQUET-1361:
-
Component/s: parquet-mr

> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp, parquet-mr
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
> Fix For: cpp-7.0.0
>
> Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, 
> sample_w_null.csv
>
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1+
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> input_file = 'sample_w_null.csv'
> output_file = 'int_unknown.parquet'
> p_schema = {'col1': np.int32,
> 'col2': np.int32,
> 'col3': np.unicode_,
> 'col4': np.int32,
> 'col5': np.int32}
> df = pd.read_csv(input_file, dtype=p_schema)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, output_file)
> {code}
> +Step 2+
> Inspect the metadata of the generated file.
> {{parquet-1361-repro-2.py}}
> {code}
> #!/usr/bin/python
> import pyarrow.parquet as pq
> for filename in ['int_unknown.parquet']:
> pq_file = pq.ParquetFile(filename)
> print(pq_file.metadata)
> print(pq_file.schema)
> print(pq_file.num_row_groups)
> print(pq.read_table(filename, 
> columns=['col1','col2','col3','col4','col5']).to_pandas())
> {code}
> Results
> {code}
> 
>   created_by: parquet-cpp version 1.4.1-SNAPSHOT
>   num_columns: 6
>   num_rows: 1
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 1434
> 
> col1: INT32
> col2: INT32
> col3: INT32 UNKNOWN
> col4: INT32
> col5: INT32
> __index_level_0__: INT64
> 1
>col1  col2  col3  col4  col5
> 0 1 2  None 4 5
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2021-11-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1361:

Fix Version/s: cpp-7.0.0
   (was: cpp-6.0.0)

> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
> Fix For: cpp-7.0.0
>
> Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, 
> sample_w_null.csv
>
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1+
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> input_file = 'sample_w_null.csv'
> output_file = 'int_unknown.parquet'
> p_schema = {'col1': np.int32,
> 'col2': np.int32,
> 'col3': np.unicode_,
> 'col4': np.int32,
> 'col5': np.int32}
> df = pd.read_csv(input_file, dtype=p_schema)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, output_file)
> {code}
> +Step 2+
> Inspect the metadata of the generated file.
> {{parquet-1361-repro-2.py}}
> {code}
> #!/usr/bin/python
> import pyarrow.parquet as pq
> for filename in ['int_unknown.parquet']:
> pq_file = pq.ParquetFile(filename)
> print(pq_file.metadata)
> print(pq_file.schema)
> print(pq_file.num_row_groups)
> print(pq.read_table(filename, 
> columns=['col1','col2','col3','col4','col5']).to_pandas())
> {code}
> Results
> {code}
> 
>   created_by: parquet-cpp version 1.4.1-SNAPSHOT
>   num_columns: 6
>   num_rows: 1
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 1434
> 
> col1: INT32
> col2: INT32
> col3: INT32 UNKNOWN
> col4: INT32
> col5: INT32
> __index_level_0__: INT64
> 1
>col1  col2  col3  col4  col5
> 0 1 2  None 4 5
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2021-08-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1361:

Fix Version/s: (was: cpp-5.0.0)
   cpp-6.0.0

> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
> Fix For: cpp-6.0.0
>
> Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, 
> sample_w_null.csv
>
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1+
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> input_file = 'sample_w_null.csv'
> output_file = 'int_unknown.parquet'
> p_schema = {'col1': np.int32,
> 'col2': np.int32,
> 'col3': np.unicode_,
> 'col4': np.int32,
> 'col5': np.int32}
> df = pd.read_csv(input_file, dtype=p_schema)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, output_file)
> {code}
> +Step 2+
> Inspect the metadata of the generated file.
> {{parquet-1361-repro-2.py}}
> {code}
> #!/usr/bin/python
> import pyarrow.parquet as pq
> for filename in ['int_unknown.parquet']:
> pq_file = pq.ParquetFile(filename)
> print(pq_file.metadata)
> print(pq_file.schema)
> print(pq_file.num_row_groups)
> print(pq.read_table(filename, 
> columns=['col1','col2','col3','col4','col5']).to_pandas())
> {code}
> Results
> {code}
> 
>   created_by: parquet-cpp version 1.4.1-SNAPSHOT
>   num_columns: 6
>   num_rows: 1
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 1434
> 
> col1: INT32
> col2: INT32
> col3: INT32 UNKNOWN
> col4: INT32
> col5: INT32
> __index_level_0__: INT64
> 1
>col1  col2  col3  col4  col5
> 0 1 2  None 4 5
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2021-05-27 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1361:

Fix Version/s: (was: cpp-4.0.0)
   cpp-5.0.0

> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
> Fix For: cpp-5.0.0
>
> Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, 
> sample_w_null.csv
>
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1+
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> input_file = 'sample_w_null.csv'
> output_file = 'int_unknown.parquet'
> p_schema = {'col1': np.int32,
> 'col2': np.int32,
> 'col3': np.unicode_,
> 'col4': np.int32,
> 'col5': np.int32}
> df = pd.read_csv(input_file, dtype=p_schema)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, output_file)
> {code}
> +Step 2+
> Inspect the metadata of the generated file.
> {{parquet-1361-repro-2.py}}
> {code}
> #!/usr/bin/python
> import pyarrow.parquet as pq
> for filename in ['int_unknown.parquet']:
> pq_file = pq.ParquetFile(filename)
> print(pq_file.metadata)
> print(pq_file.schema)
> print(pq_file.num_row_groups)
> print(pq.read_table(filename, 
> columns=['col1','col2','col3','col4','col5']).to_pandas())
> {code}
> Results
> {code}
> 
>   created_by: parquet-cpp version 1.4.1-SNAPSHOT
>   num_columns: 6
>   num_rows: 1
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 1434
> 
> col1: INT32
> col2: INT32
> col3: INT32 UNKNOWN
> col4: INT32
> col5: INT32
> __index_level_0__: INT64
> 1
>col1  col2  col3  col4  col5
> 0 1 2  None 4 5
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2018-11-12 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1361:
--
Fix Version/s: cpp-1.6.0

> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
> Fix For: cpp-1.6.0
>
> Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, 
> sample_w_null.csv
>
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1+
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> input_file = 'sample_w_null.csv'
> output_file = 'int_unknown.parquet'
> p_schema = {'col1': np.int32,
> 'col2': np.int32,
> 'col3': np.unicode_,
> 'col4': np.int32,
> 'col5': np.int32}
> df = pd.read_csv(input_file, dtype=p_schema)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, output_file)
> {code}
> +Step 2+
> Inspect the metadata of the generated file.
> {{parquet-1361-repro-2.py}}
> {code}
> #!/usr/bin/python
> import pyarrow.parquet as pq
> for filename in ['int_unknown.parquet']:
> pq_file = pq.ParquetFile(filename)
> print(pq_file.metadata)
> print(pq_file.schema)
> print(pq_file.num_row_groups)
> print(pq.read_table(filename, 
> columns=['col1','col2','col3','col4','col5']).to_pandas())
> {code}
> Results
> {code}
> 
>   created_by: parquet-cpp version 1.4.1-SNAPSHOT
>   num_columns: 6
>   num_rows: 1
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 1434
> 
> col1: INT32
> col2: INT32
> col3: INT32 UNKNOWN
> col4: INT32
> col5: INT32
> __index_level_0__: INT64
> 1
>col1  col2  col3  col4  col5
> 0 1 2  None 4 5
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2018-07-28 Thread Ken Terada (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Terada updated PARQUET-1361:

Attachment: sample_w_null.csv
parquet-1361-repro-2.py
parquet-1361-repro-1.py

> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
> Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, 
> sample_w_null.csv
>
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1+
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> input_file = 'sample_w_null.csv'
> output_file = 'int_unknown.parquet'
> p_schema = {'col1': np.int32,
> 'col2': np.int32,
> 'col3': np.unicode_,
> 'col4': np.int32,
> 'col5': np.int32}
> df = pd.read_csv(input_file, dtype=p_schema)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, output_file)
> {code}
> +Step 2+
> Inspect the metadata of the generated file.
> {{parquet-1361-repro-2.py}}
> {code}
> #!/usr/bin/python
> import pyarrow.parquet as pq
> for filename in ['int_unknown.parquet']:
> pq_file = pq.ParquetFile(filename)
> print(pq_file.metadata)
> print(pq_file.schema)
> print(pq_file.num_row_groups)
> print(pq.read_table(filename, 
> columns=['col1','col2','col3','col4','col5']).to_pandas())
> {code}
> Results
> {code}
> 
>   created_by: parquet-cpp version 1.4.1-SNAPSHOT
>   num_columns: 6
>   num_rows: 1
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 1434
> 
> col1: INT32
> col2: INT32
> col3: INT32 UNKNOWN
> col4: INT32
> col5: INT32
> __index_level_0__: INT64
> 1
>col1  col2  col3  col4  col5
> 0 1 2  None 4 5
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2018-07-28 Thread Ken Terada (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Terada updated PARQUET-1361:

Description: 
The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
values for INT type columns which causes unexpected parsing errors in 
downstream systems ingesting those files.

e.g.,
{{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
type}}

*+Reproduction Steps+*

OS: CentOS 7.5.1804
Python: 3.4.8

+Prerequisites:+
* Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
{{PyArrow: 0.9.0}}

+Step 1: +

Generate the parquet file.

{{sample_w_null.csv}}

{code}
col1,col2,col3,col4,col5
1,2,,4,5
{code}

{{parquet-1361-repro-1.py}}

{code}
#!/usr/bin/python

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

input_file = 'sample_w_null.csv'
output_file = 'int_unknown.parquet'
p_schema = {'col1': np.int32,
'col2': np.int32,
'col3': np.unicode_,
'col4': np.int32,
'col5': np.int32}

df = pd.read_csv(input_file, dtype=p_schema)
table = pa.Table.from_pandas(df)
pq.write_table(table, output_file)
{code}

+Step 2:+

Inspect the metadata of the generated file.

{{parquet-1361-repro-2.py}}

{code}
#!/usr/bin/python

import pyarrow.parquet as pq

for filename in ['int_unknown.parquet']:
pq_file = pq.ParquetFile(filename)
print(pq_file.metadata)
print(pq_file.schema)
print(pq_file.num_row_groups)
print(pq.read_table(filename, 
columns=['col1','col2','col3','col4','col5']).to_pandas())
{code}

Results

{code}

  created_by: parquet-cpp version 1.4.1-SNAPSHOT
  num_columns: 6
  num_rows: 1
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 1434

col1: INT32
col2: INT32
col3: INT32 UNKNOWN
col4: INT32
col5: INT32
__index_level_0__: INT64

1
   col1  col2  col3  col4  col5
0 1 2  None 4 5
{code}

  was:
The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
values for INT type columns which causes unexpected parsing errors in 
downstream systems ingesting those files.

e.g.,
{{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
type}}

+Reproduction Steps+

OS: CentOS 7.5.1804
Python: 3.4.8

+Prerequisites:+
* Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
{{PyArrow: 0.9.0}}

+Step 1: +

Generate the parquet file.

{{sample_w_null.csv}}

{code}
col1,col2,col3,col4,col5
1,2,,4,5
{code}

{{parquet-1361-repro-1.py}}

{code}
#!/usr/bin/python

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

input_file = 'sample_w_null.csv'
output_file = 'int_unknown.parquet'
p_schema = {'col1': np.int32,
'col2': np.int32,
'col3': np.unicode_,
'col4': np.int32,
'col5': np.int32}

df = pd.read_csv(input_file, dtype=p_schema)
table = pa.Table.from_pandas(df)
pq.write_table(table, output_file)
{code}

+Step 2:+

Inspect the metadata of the generated file.

{{parquet-1361-repro-2.py}}

{code}
#!/usr/bin/python

import pyarrow.parquet as pq

for filename in ['int_unknown.parquet']:
pq_file = pq.ParquetFile(filename)
print(pq_file.metadata)
print(pq_file.schema)
print(pq_file.num_row_groups)
print(pq.read_table(filename, 
columns=['col1','col2','col3','col4','col5']).to_pandas())
{code}

Results

{code}

  created_by: parquet-cpp version 1.4.1-SNAPSHOT
  num_columns: 6
  num_rows: 1
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 1434

col1: INT32
col2: INT32
col3: INT32 UNKNOWN
col4: INT32
col5: INT32
__index_level_0__: INT64

1
   col1  col2  col3  col4  col5
0 1 2  None 4 5
{code}


> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1: +
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import 

[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2018-07-28 Thread Ken Terada (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Terada updated PARQUET-1361:

Description: 
The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
values for INT type columns which causes unexpected parsing errors in 
downstream systems ingesting those files.

e.g.,
{{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
type}}

*+Reproduction Steps+*

OS: CentOS 7.5.1804
Python: 3.4.8

+Prerequisites:+
* Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
{{PyArrow: 0.9.0}}

+Step 1+

Generate the parquet file.

{{sample_w_null.csv}}

{code}
col1,col2,col3,col4,col5
1,2,,4,5
{code}

{{parquet-1361-repro-1.py}}

{code}
#!/usr/bin/python

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

input_file = 'sample_w_null.csv'
output_file = 'int_unknown.parquet'
p_schema = {'col1': np.int32,
'col2': np.int32,
'col3': np.unicode_,
'col4': np.int32,
'col5': np.int32}

df = pd.read_csv(input_file, dtype=p_schema)
table = pa.Table.from_pandas(df)
pq.write_table(table, output_file)
{code}

+Step 2+

Inspect the metadata of the generated file.

{{parquet-1361-repro-2.py}}

{code}
#!/usr/bin/python

import pyarrow.parquet as pq

for filename in ['int_unknown.parquet']:
pq_file = pq.ParquetFile(filename)
print(pq_file.metadata)
print(pq_file.schema)
print(pq_file.num_row_groups)
print(pq.read_table(filename, 
columns=['col1','col2','col3','col4','col5']).to_pandas())
{code}

Results

{code}

  created_by: parquet-cpp version 1.4.1-SNAPSHOT
  num_columns: 6
  num_rows: 1
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 1434

col1: INT32
col2: INT32
col3: INT32 UNKNOWN
col4: INT32
col5: INT32
__index_level_0__: INT64

1
   col1  col2  col3  col4  col5
0 1 2  None 4 5
{code}

  was:
The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
values for INT type columns which causes unexpected parsing errors in 
downstream systems ingesting those files.

e.g.,
{{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
type}}

*+Reproduction Steps+*

OS: CentOS 7.5.1804
Python: 3.4.8

+Prerequisites:+
* Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
{{PyArrow: 0.9.0}}

+Step 1: +

Generate the parquet file.

{{sample_w_null.csv}}

{code}
col1,col2,col3,col4,col5
1,2,,4,5
{code}

{{parquet-1361-repro-1.py}}

{code}
#!/usr/bin/python

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

input_file = 'sample_w_null.csv'
output_file = 'int_unknown.parquet'
p_schema = {'col1': np.int32,
'col2': np.int32,
'col3': np.unicode_,
'col4': np.int32,
'col5': np.int32}

df = pd.read_csv(input_file, dtype=p_schema)
table = pa.Table.from_pandas(df)
pq.write_table(table, output_file)
{code}

+Step 2:+

Inspect the metadata of the generated file.

{{parquet-1361-repro-2.py}}

{code}
#!/usr/bin/python

import pyarrow.parquet as pq

for filename in ['int_unknown.parquet']:
pq_file = pq.ParquetFile(filename)
print(pq_file.metadata)
print(pq_file.schema)
print(pq_file.num_row_groups)
print(pq.read_table(filename, 
columns=['col1','col2','col3','col4','col5']).to_pandas())
{code}

Results

{code}

  created_by: parquet-cpp version 1.4.1-SNAPSHOT
  num_columns: 6
  num_rows: 1
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 1434

col1: INT32
col2: INT32
col3: INT32 UNKNOWN
col4: INT32
col5: INT32
__index_level_0__: INT64

1
   col1  col2  col3  col4  col5
0 1 2  None 4 5
{code}


> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1+
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import 

[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2018-07-28 Thread Ken Terada (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Terada updated PARQUET-1361:

Description: 
The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
values for INT type columns which causes unexpected parsing errors in 
downstream systems ingesting those files.

e.g.,
{{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
type}}

+Reproduction Steps+

OS: CentOS 7.5.1804
Python: 3.4.8

+Prerequisites:+
* Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
{{PyArrow: 0.9.0}}

+Step 1: +

Generate the parquet file.

{{sample_w_null.csv}}

{code}
col1,col2,col3,col4,col5
1,2,,4,5
{code}

{{parquet-1361-repro-1.py}}

{code}
#!/usr/bin/python

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

input_file = 'sample_w_null.csv'
output_file = 'int_unknown.parquet'
p_schema = {'col1': np.int32,
'col2': np.int32,
'col3': np.unicode_,
'col4': np.int32,
'col5': np.int32}

df = pd.read_csv(input_file, dtype=p_schema)
table = pa.Table.from_pandas(df)
pq.write_table(table, output_file)
{code}

+Step 2:+

Inspect the metadata of the generated file.

{{parquet-1361-repro-2.py}}

{code}
#!/usr/bin/python

import pyarrow.parquet as pq

for filename in ['int_unknown.parquet']:
pq_file = pq.ParquetFile(filename)
print(pq_file.metadata)
print(pq_file.schema)
print(pq_file.num_row_groups)
print(pq.read_table(filename, 
columns=['col1','col2','col3','col4','col5']).to_pandas())
{code}

Results

{code}

  created_by: parquet-cpp version 1.4.1-SNAPSHOT
  num_columns: 6
  num_rows: 1
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 1434

col1: INT32
col2: INT32
col3: INT32 UNKNOWN
col4: INT32
col5: INT32
__index_level_0__: INT64

1
   col1  col2  col3  col4  col5
0 1 2  None 4 5
{code}
Summary: [C++] 1.4.1 library allows creation of parquet file w/NULL 
values for INT types  (was: [C++] 1.4.1 library allows creation of parquet file 
w/UNKNOWN INT data)

> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> +Reproduction Steps+
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1: +
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> input_file = 'sample_w_null.csv'
> output_file = 'int_unknown.parquet'
> p_schema = {'col1': np.int32,
> 'col2': np.int32,
> 'col3': np.unicode_,
> 'col4': np.int32,
> 'col5': np.int32}
> df = pd.read_csv(input_file, dtype=p_schema)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, output_file)
> {code}
> +Step 2:+
> Inspect the metadata of the generated file.
> {{parquet-1361-repro-2.py}}
> {code}
> #!/usr/bin/python
> import pyarrow.parquet as pq
> for filename in ['int_unknown.parquet']:
> pq_file = pq.ParquetFile(filename)
> print(pq_file.metadata)
> print(pq_file.schema)
> print(pq_file.num_row_groups)
> print(pq.read_table(filename, 
> columns=['col1','col2','col3','col4','col5']).to_pandas())
> {code}
> Results
> {code}
> 
>   created_by: parquet-cpp version 1.4.1-SNAPSHOT
>   num_columns: 6
>   num_rows: 1
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 1434
> 
> col1: INT32
> col2: INT32
> col3: INT32 UNKNOWN
> col4: INT32
> col5: INT32
> __index_level_0__: INT64
> 1
>col1  col2  col3  col4  col5
> 0 1 2  None 4 5
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)