[jira] [Commented] (PARQUET-1430) [C++] Add tests for C++ tools

2022-02-08 Thread Deepak Majeti (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488845#comment-17488845
 ] 

Deepak Majeti commented on PARQUET-1430:


[~apitrou] sorry for the delayed reply. I am not planning to work on this 
anytime soon, so I unassigned myself.

> [C++] Add tests for C++ tools
> -
>
> Key: PARQUET-1430
> URL: https://issues.apache.org/jira/browse/PARQUET-1430
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> We currently do not have any tests for the tools.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (PARQUET-1430) [C++] Add tests for C++ tools

2022-02-08 Thread Deepak Majeti (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1430:
--

Assignee: (was: Deepak Majeti)

> [C++] Add tests for C++ tools
> -
>
> Key: PARQUET-1430
> URL: https://issues.apache.org/jira/browse/PARQUET-1430
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> We currently do not have any tests for the tools.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1848) Add Index support in the read path

2020-04-23 Thread Deepak Majeti (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1848:
---
Description: 
The scope of this Jira is to add support for reading indexes from a Parquet 
file.

The changes will involve de-serializing indexes and adding API to return them.

To test the implementation, we can get Parquet files with indexes generated via 
Impala or parquet-mr and see that the parquet-cpp tools can print them.

  was:
The scope of this Jira is to add support for reading indexes from a Parquet 
file.

The changes will involve de-serializing indexes and adding API to return them.


> Add Index support in the read path
> --
>
> Key: PARQUET-1848
> URL: https://issues.apache.org/jira/browse/PARQUET-1848
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
>
> The scope of this Jira is to add support for reading indexes from a Parquet 
> file.
> The changes will involve de-serializing indexes and adding API to return them.
> To test the implementation, we can get Parquet files with indexes generated 
> via Impala or parquet-mr and see that the parquet-cpp tools can print them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1849) Add index support in the write path

2020-04-23 Thread Deepak Majeti (Jira)
Deepak Majeti created PARQUET-1849:
--

 Summary: Add index support in the write path
 Key: PARQUET-1849
 URL: https://issues.apache.org/jira/browse/PARQUET-1849
 Project: Parquet
  Issue Type: Sub-task
Reporter: Deepak Majeti
Assignee: Deepak Majeti


The scope of this Jira is to add index support in the write path.

The changes will involve computing the indexes followed by serializing them and 
writing to the file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1848) Add Index support in the read path

2020-04-23 Thread Deepak Majeti (Jira)
Deepak Majeti created PARQUET-1848:
--

 Summary: Add Index support in the read path
 Key: PARQUET-1848
 URL: https://issues.apache.org/jira/browse/PARQUET-1848
 Project: Parquet
  Issue Type: Sub-task
Reporter: Deepak Majeti
Assignee: Deepak Majeti


The scope of this Jira is to add support for reading indexes from a Parquet 
file.

The changes will involve de-serializing indexes and adding API to return them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1404) [C++] Add index pages to the format to support efficient page skipping to parquet-cpp

2020-04-23 Thread Deepak Majeti (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1404:
---
Description: 
The scope of the Jira is to take advantage of indexes added as part of 
PARQUET-922

It is easier to implement this if we create sub-tasks

  was:
The scope of the Jira is to take advantage of indexes added as part of Parquet-

It is easier to implement this if we create sub-tasks


> [C++] Add index pages to the format to support efficient page skipping to 
> parquet-cpp
> -
>
> Key: PARQUET-1404
> URL: https://issues.apache.org/jira/browse/PARQUET-1404
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Renato Javier Marroquín Mogrovejo
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The scope of the Jira is to take advantage of indexes added as part of 
> PARQUET-922
> It is easier to implement this if we create sub-tasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1404) [C++] Add index pages to the format to support efficient page skipping to parquet-cpp

2020-04-23 Thread Deepak Majeti (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1404:
---
Description: 
The scope of the Jira is to take advantage of indexes added as part of Parquet-

It is easier to implement this if we create sub-tasks

  was:Once PARQUET-922 is completed we can port such implementation to 
parquet-cpp as well.


> [C++] Add index pages to the format to support efficient page skipping to 
> parquet-cpp
> -
>
> Key: PARQUET-1404
> URL: https://issues.apache.org/jira/browse/PARQUET-1404
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Renato Javier Marroquín Mogrovejo
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The scope of the Jira is to take advantage of indexes added as part of 
> Parquet-
> It is easier to implement this if we create sub-tasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1781) [C++] 1.4.0+ reader ignore stats created by 1.3.* writer

2020-02-04 Thread Deepak Majeti (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030160#comment-17030160
 ] 

Deepak Majeti commented on PARQUET-1781:


Even though the 1.3 writer wrote the "min_value", "max_value" along with the 
old "min", "max", the new statistics are not valid since the column order is 
not set according to the Parquet spec. In a way, this is a bug in the 1.3 
reader to return new stats without verifying the column order. The reader in 
1.4 does the right thing.

> [C++] 1.4.0+ reader ignore stats created by 1.3.* writer
> 
>
> Key: PARQUET-1781
> URL: https://issues.apache.org/jira/browse/PARQUET-1781
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0, cpp-1.5.0
>Reporter: Milos Sukovic
>Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> [https://github.com/apache/arrow/commit/d257a88ed612301c0411894dfa783fcbff1bc867]
> In referenced commit, change to metadata.cc file changed the way for checking 
> if new stats (min_value/max_value) are used.
> From
> if (metadata.statistics.__isset.max_value || 
> metadata.statistics.__isset.min_value)
> to
> if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER)
>  
> This change is breaking backward compat - all files which contain new stats 
> (min_value/max_value), and are created before this change are valid, but they 
> do not set column order flag.
> After this change, those stats are ignored, because column order flag is 
> checked.
> Possible fix would be something like:
> if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER || 
> (version == parquetcpp 1.3.* && (metadata.statistics.__isset.max_value || 
> metadata.statistics.__isset.min_value)))
> I checked parquet-mr, and it seems like there, columnOrder is introduced as 
> part of the same change as min_value and max_value, so issue shouldn't happen 
> for files created by java code, but probably, stats are ignored by their 
> reader too for files created by parquet-cpp 1.3.*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

2020-01-13 Thread Deepak Majeti (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014666#comment-17014666
 ] 

Deepak Majeti commented on PARQUET-1698:


How about adding API to the _RowGroupReader_ that will return the _col_start_ 
and _col_length_ for each column?  parquet-cpp clients can then pass an 
InputStream directly for each _ColumnReader_.

 

> [C++] Add reader option to pre-buffer entire serialized row group into memory
> -
>
> Key: PARQUET-1698
> URL: https://issues.apache.org/jira/browse/PARQUET-1698
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Zherui Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some scenarios (example: reading datasets from Amazon S3), reading columns 
> independently and allowing unbridled {{Read}} calls to the underlying file 
> handle can yield suboptimal performance. In such cases, it may be preferable 
> to first read the entire serialized row group into memory then deserialize 
> the constituent columns from this
> Note that such an option would not be appropriate as a default behavior for 
> all file handle types since low-selectivity reads (example: reading only 3 
> columns out of a file with 100 columns)  will be suboptimal in some cases. I 
> think it would be better for "high latency" file systems to opt into this 
> option
> cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

2020-01-13 Thread Deepak Majeti (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014666#comment-17014666
 ] 

Deepak Majeti edited comment on PARQUET-1698 at 1/13/20 9:36 PM:
-

How about adding API to the _RowGroupReader_ that will return the _col_start_ 
and _col_length_ for each column chunk?  parquet-cpp clients can then pass an 
InputStream directly for each _ColumnReader_.

 


was (Author: mdeepak):
How about adding API to the _RowGroupReader_ that will return the _col_start_ 
and _col_length_ for each column?  parquet-cpp clients can then pass an 
InputStream directly for each _ColumnReader_.

 

> [C++] Add reader option to pre-buffer entire serialized row group into memory
> -
>
> Key: PARQUET-1698
> URL: https://issues.apache.org/jira/browse/PARQUET-1698
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Zherui Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some scenarios (example: reading datasets from Amazon S3), reading columns 
> independently and allowing unbridled {{Read}} calls to the underlying file 
> handle can yield suboptimal performance. In such cases, it may be preferable 
> to first read the entire serialized row group into memory then deserialize 
> the constituent columns from this
> Note that such an option would not be appropriate as a default behavior for 
> all file handle types since low-selectivity reads (example: reading only 3 
> columns out of a file with 100 columns)  will be suboptimal in some cases. I 
> think it would be better for "high latency" file systems to opt into this 
> option
> cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

2019-12-29 Thread Deepak Majeti (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1698:
--

Assignee: Zherui Cao

> [C++] Add reader option to pre-buffer entire serialized row group into memory
> -
>
> Key: PARQUET-1698
> URL: https://issues.apache.org/jira/browse/PARQUET-1698
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Zherui Cao
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> In some scenarios (example: reading datasets from Amazon S3), reading columns 
> independently and allowing unbridled {{Read}} calls to the underlying file 
> handle can yield suboptimal performance. In such cases, it may be preferable 
> to first read the entire serialized row group into memory then deserialize 
> the constituent columns from this
> Note that such an option would not be appropriate as a default behavior for 
> all file handle types since low-selectivity reads (example: reading only 3 
> columns out of a file with 100 columns)  will be suboptimal in some cases. I 
> think it would be better for "high latency" file systems to opt into this 
> option
> cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1663) Provide API to check the presence of complex data types

2019-09-23 Thread Deepak Majeti (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1663:
---
Summary: Provide API to check the presence of complex data types  (was: 
Provide API to indicate if complex data type exist.)

> Provide API to check the presence of complex data types
> ---
>
> Key: PARQUET-1663
> URL: https://issues.apache.org/jira/browse/PARQUET-1663
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Zherui Cao
>Assignee: Zherui Cao
>Priority: Major
>
> we need functions like
> containsMap()
> containsArray()
> containsStruct()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1655) [C++] Decimal comparisons used for min/max statistics are not correct

2019-09-18 Thread Deepak Majeti (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932644#comment-16932644
 ] 

Deepak Majeti commented on PARQUET-1655:


The Decimal values comparator indeed needs to be fixed.  The java code (below) 
has the decimal comparator and additionally handles decimals of different 
lengths via padding.  (I don't see the msb flip but I know it is one way to 
compare twos-complement binary )

[https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java#L230]

[https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java#L379]

 

 

 

> [C++] Decimal comparisons used for min/max statistics are not correct
> -
>
> Key: PARQUET-1655
> URL: https://issues.apache.org/jira/browse/PARQUET-1655
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Philip Felton
>Priority: Major
>
> The [Parquet Format 
> specifications|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
>  says
> bq. If the column uses int32 or int64 physical types, then signed comparison 
> of the integer values produces the correct ordering. If the physical type is 
> fixed, then the correct ordering can be produced by flipping the 
> most-significant bit in the first byte and then using unsigned byte-wise 
> comparison.
> However this isn't followed in the C++ Parquet code. 16-byte decimal 
> comparison is implemented using a lexicographical comparison of signed chars.
> This appears to be because the function 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
>  just goes off the sort_order (signed) and physical_type 
> (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1636) [C++] Incompatibility due to moving from Parquet to Arrow IO interfaces

2019-08-08 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903052#comment-16903052
 ] 

Deepak Majeti commented on PARQUET-1636:


I might be missing something here. Peek() moving the raw offset might not be 
the core issue. The old ReadAt changes the offset too. [~czxrrr] can you 
clarify this? Thanks.

> [C++] Incompatibility due to moving from Parquet to Arrow IO interfaces
> ---
>
> Key: PARQUET-1636
> URL: https://issues.apache.org/jira/browse/PARQUET-1636
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Wes McKinney
>Priority: Major
>
> We moved to the Arrow IO interfaces as part of 
> https://issues.apache.org/jira/browse/PARQUET-1422
> However, the BufferedInputStream implementations between Parquet and Arrow 
> are different.
> Parquet's BufferedInputStream used to takes a RandomAccessSource. Arrow's 
> implementation takes an InputStream. As a result, the 
> {{::arrow::io::BufferedInputStream::Peek(which invokes Read())}} 
> implementation causes the raw source (input to {{BufferedInputStream}}) to 
> change its offset on Peek(). This did not happen in the Parquet's 
> BufferedInputStream implementation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (PARQUET-1636) [C++] Incompatibility due to moving from Parquet to Arrow IO interfaces

2019-08-08 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903040#comment-16903040
 ] 

Deepak Majeti commented on PARQUET-1636:


CC: [~czxrrr]

> [C++] Incompatibility due to moving from Parquet to Arrow IO interfaces
> ---
>
> Key: PARQUET-1636
> URL: https://issues.apache.org/jira/browse/PARQUET-1636
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Wes McKinney
>Priority: Major
>
> We moved to the Arrow IO interfaces as part of 
> https://issues.apache.org/jira/browse/PARQUET-1422
> However, the BufferedInputStream implementations between Parquet and Arrow 
> are different.
> Parquet's BufferedInputStream used to takes a RandomAccessSource. Arrow's 
> implementation takes an InputStream. As a result, the 
> {{::arrow::io::BufferedInputStream::Peek(which invokes Read())}} 
> implementation causes the raw source (input to {{BufferedInputStream}}) to 
> change its offset on Peek(). This did not happen in the Parquet's 
> BufferedInputStream implementation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (PARQUET-1636) [C++] Incompatibility due to moving from Parquet to Arrow IO interfaces

2019-08-08 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1636:
--

 Summary: [C++] Incompatibility due to moving from Parquet to Arrow 
IO interfaces
 Key: PARQUET-1636
 URL: https://issues.apache.org/jira/browse/PARQUET-1636
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Deepak Majeti
Assignee: Wes McKinney


We moved to the Arrow IO interfaces as part of 
https://issues.apache.org/jira/browse/PARQUET-1422

However, the BufferedInputStream implementations between Parquet and Arrow are 
different.

Parquet's BufferedInputStream used to takes a RandomAccessSource. Arrow's 
implementation takes an InputStream. As a result, the 
{{::arrow::io::BufferedInputStream::Peek(which invokes Read())}} implementation 
causes the raw source (input to {{BufferedInputStream}}) to change its offset 
on Peek(). This did not happen in the Parquet's BufferedInputStream 
implementation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (PARQUET-1626) [C++] Ability to concat parquet files

2019-07-17 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887447#comment-16887447
 ] 

Deepak Majeti commented on PARQUET-1626:


The simplest approach is to read the two files and export them back as a single 
file. You can follow the existing reader-writer.cc example to do this.

If you want to optimize by avoiding the compression/decompression of the Data 
Pages, then you have to carefully update the metadata (counts, stats, offsets, 
lengths, etc.) at the File, ColumnChunk levels and append the individual 
RowGroups. If you further want to append two RowGroups together, you have to 
update the RowGroup metadata.

On top of this, you have to ensure the two files being merged are compatible. 
In your case, this won't be a problem since you have the same writer generating 
the parquet files with the same schema. But in general, if the two files are 
generated from different writers, you cannot easily merge them.

> [C++] Ability to concat parquet files 
> --
>
> Key: PARQUET-1626
> URL: https://issues.apache.org/jira/browse/PARQUET-1626
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Affects Versions: cpp-1.3.1
>Reporter: nileema shingte
>Priority: Major
>  Labels: features
>
> Ability to concat the parquet files is something we've wanted for some time 
> too. When we generate parquet files partitioned by an expression, we often 
> end up with tiny files and would like to add a post-processing step to concat 
> these files together.
> Is there a plan to add this ability to the library any time soon? 
> If not, it would be great if someone can provide a somewhat detailed 
> pseudocode (expanding on what [~xhochy] mentioned in the comment in 
> PARQUET-1022) as a guideline for conditions/scenarios that need to be handled 
> with extra care, so we can contribute this as a PR. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (PARQUET-1603) [C++][Rust] rename parquet::LogicalType to parquet::ConvertedType

2019-06-19 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1603:
--

Assignee: Deepak Majeti

> [C++][Rust] rename parquet::LogicalType to parquet::ConvertedType
> -
>
> Key: PARQUET-1603
> URL: https://issues.apache.org/jira/browse/PARQUET-1603
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Joris Van den Bossche
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: c++
>
> From discussion on the mailing list 
> (https://lists.apache.org/thread.html/5cbbd351b17aed50f40df286bd2f080cb6e5e9b23e5a5c79b7e6e041@%3Cdev.parquet.apache.org%3E),
>  the idea is to rename parquet-cpp's current {{LogicalType}} to 
> {{ConvertedType}}, and the new {{LogicalAnnotation}} to {{LogicalType}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1603) [C++] rename parquet::LogicalType to parquet::ConvertedType

2019-06-19 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1603:
---
Summary: [C++] rename parquet::LogicalType to parquet::ConvertedType  (was: 
[C++][Rust] rename parquet::LogicalType to parquet::ConvertedType)

> [C++] rename parquet::LogicalType to parquet::ConvertedType
> ---
>
> Key: PARQUET-1603
> URL: https://issues.apache.org/jira/browse/PARQUET-1603
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Joris Van den Bossche
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: c++
>
> From discussion on the mailing list 
> (https://lists.apache.org/thread.html/5cbbd351b17aed50f40df286bd2f080cb6e5e9b23e5a5c79b7e6e041@%3Cdev.parquet.apache.org%3E),
>  the idea is to rename parquet-cpp's current {{LogicalType}} to 
> {{ConvertedType}}, and the new {{LogicalAnnotation}} to {{LogicalType}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1603) [C++][Rust] rename parquet::LogicalType to parquet::ConvertedType

2019-06-19 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1603:
---
Summary: [C++][Rust] rename parquet::LogicalType to parquet::ConvertedType  
(was: [C++][R] rename parquet::LogicalType to parquet::ConvertedType)

> [C++][Rust] rename parquet::LogicalType to parquet::ConvertedType
> -
>
> Key: PARQUET-1603
> URL: https://issues.apache.org/jira/browse/PARQUET-1603
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Joris Van den Bossche
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: c++
>
> From discussion on the mailing list 
> (https://lists.apache.org/thread.html/5cbbd351b17aed50f40df286bd2f080cb6e5e9b23e5a5c79b7e6e041@%3Cdev.parquet.apache.org%3E),
>  the idea is to rename parquet-cpp's current {{LogicalType}} to 
> {{ConvertedType}}, and the new {{LogicalAnnotation}} to {{LogicalType}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1603) [C++][Rust] rename parquet::LogicalType to parquet::ConvertedType

2019-06-19 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868028#comment-16868028
 ] 

Deepak Majeti commented on PARQUET-1603:


Yes! I will be happy to finish this if you are okay with that.

> [C++][Rust] rename parquet::LogicalType to parquet::ConvertedType
> -
>
> Key: PARQUET-1603
> URL: https://issues.apache.org/jira/browse/PARQUET-1603
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: c++
>
> From discussion on the mailing list 
> (https://lists.apache.org/thread.html/5cbbd351b17aed50f40df286bd2f080cb6e5e9b23e5a5c79b7e6e041@%3Cdev.parquet.apache.org%3E),
>  the idea is to rename parquet-cpp's current {{LogicalType}} to 
> {{ConvertedType}}, and the new {{LogicalAnnotation}} to {{LogicalType}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1603) [C++][Rust] rename parquet::LogicalType to parquet::ConvertedType

2019-06-19 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1603:
--

Assignee: (was: Deepak Majeti)

> [C++][Rust] rename parquet::LogicalType to parquet::ConvertedType
> -
>
> Key: PARQUET-1603
> URL: https://issues.apache.org/jira/browse/PARQUET-1603
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: c++
>
> From discussion on the mailing list 
> (https://lists.apache.org/thread.html/5cbbd351b17aed50f40df286bd2f080cb6e5e9b23e5a5c79b7e6e041@%3Cdev.parquet.apache.org%3E),
>  the idea is to rename parquet-cpp's current {{LogicalType}} to 
> {{ConvertedType}}, and the new {{LogicalAnnotation}} to {{LogicalType}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1603) [C++][Rust] rename parquet::LogicalType to parquet::ConvertedType

2019-06-19 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868024#comment-16868024
 ] 

Deepak Majeti commented on PARQUET-1603:


[~jorisvandenbossche] did you plan to work on this?

> [C++][Rust] rename parquet::LogicalType to parquet::ConvertedType
> -
>
> Key: PARQUET-1603
> URL: https://issues.apache.org/jira/browse/PARQUET-1603
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Joris Van den Bossche
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: c++
>
> From discussion on the mailing list 
> (https://lists.apache.org/thread.html/5cbbd351b17aed50f40df286bd2f080cb6e5e9b23e5a5c79b7e6e041@%3Cdev.parquet.apache.org%3E),
>  the idea is to rename parquet-cpp's current {{LogicalType}} to 
> {{ConvertedType}}, and the new {{LogicalAnnotation}} to {{LogicalType}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1603) [C++] rename parquet::LogicalType to parquet::ConvertedType

2019-06-19 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1603:
---
Labels: c++  (was: C)

> [C++] rename parquet::LogicalType to parquet::ConvertedType
> ---
>
> Key: PARQUET-1603
> URL: https://issues.apache.org/jira/browse/PARQUET-1603
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: c++
>
> From discussion on the mailing list 
> (https://lists.apache.org/thread.html/5cbbd351b17aed50f40df286bd2f080cb6e5e9b23e5a5c79b7e6e041@%3Cdev.parquet.apache.org%3E),
>  the idea is to rename parquet-cpp's current {{LogicalType}} to 
> {{ConvertedType}}, and the new {{LogicalAnnotation}} to {{LogicalType}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1603) [C++] rename parquet::LogicalType to parquet::ConvertedType

2019-06-19 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1603:
--

Assignee: Deepak Majeti

> [C++] rename parquet::LogicalType to parquet::ConvertedType
> ---
>
> Key: PARQUET-1603
> URL: https://issues.apache.org/jira/browse/PARQUET-1603
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Joris Van den Bossche
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: c++
>
> From discussion on the mailing list 
> (https://lists.apache.org/thread.html/5cbbd351b17aed50f40df286bd2f080cb6e5e9b23e5a5c79b7e6e041@%3Cdev.parquet.apache.org%3E),
>  the idea is to rename parquet-cpp's current {{LogicalType}} to 
> {{ConvertedType}}, and the new {{LogicalAnnotation}} to {{LogicalType}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1603) [C++][R] rename parquet::LogicalType to parquet::ConvertedType

2019-06-19 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1603:
---
Summary: [C++][R] rename parquet::LogicalType to parquet::ConvertedType  
(was: [C++] rename parquet::LogicalType to parquet::ConvertedType)

> [C++][R] rename parquet::LogicalType to parquet::ConvertedType
> --
>
> Key: PARQUET-1603
> URL: https://issues.apache.org/jira/browse/PARQUET-1603
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Joris Van den Bossche
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: c++
>
> From discussion on the mailing list 
> (https://lists.apache.org/thread.html/5cbbd351b17aed50f40df286bd2f080cb6e5e9b23e5a5c79b7e6e041@%3Cdev.parquet.apache.org%3E),
>  the idea is to rename parquet-cpp's current {{LogicalType}} to 
> {{ConvertedType}}, and the new {{LogicalAnnotation}} to {{LogicalType}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-770) [C++] Implement PARQUET-686 statistics bug fixes

2019-06-01 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-770.
---
   Resolution: Fixed
Fix Version/s: cpp-1.5.0

> [C++] Implement PARQUET-686 statistics bug fixes
> 
>
> Key: PARQUET-770
> URL: https://issues.apache.org/jira/browse/PARQUET-770
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> The statistics written by the parquet-mr / parquet-cpp could be incorrect for 
> certain data types.
> parquet-mr JIRA: PARQUET-686
> Issue Discussion: https://github.com/apache/parquet-mr/pull/362
> parquet-mr patch: https://github.com/apache/parquet-mr/pull/367



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-770) [C++] Implement PARQUET-686 statistics bug fixes

2019-06-01 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853859#comment-16853859
 ] 

Deepak Majeti commented on PARQUET-770:
---

[~wesmckinn] this should be resolved now.

> [C++] Implement PARQUET-686 statistics bug fixes
> 
>
> Key: PARQUET-770
> URL: https://issues.apache.org/jira/browse/PARQUET-770
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
>
> The statistics written by the parquet-mr / parquet-cpp could be incorrect for 
> certain data types.
> parquet-mr JIRA: PARQUET-686
> Issue Discussion: https://github.com/apache/parquet-mr/pull/362
> parquet-mr patch: https://github.com/apache/parquet-mr/pull/367



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1404) [C++] Add index pages to the format to support efficient page skipping to parquet-cpp

2019-05-21 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1404:
--

Assignee: Deepak Majeti

> [C++] Add index pages to the format to support efficient page skipping to 
> parquet-cpp
> -
>
> Key: PARQUET-1404
> URL: https://issues.apache.org/jira/browse/PARQUET-1404
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Renato Javier Marroquín Mogrovejo
>Assignee: Deepak Majeti
>Priority: Major
>
> Once PARQUET-922 is completed we can port such implementation to parquet-cpp 
> as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-517) [C++] Use arrow::MemoryPool for all heap allocations

2019-05-20 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-517:
-

Assignee: Anatoli Shein  (was: Deepak Majeti)

> [C++] Use arrow::MemoryPool for all heap allocations
> 
>
> Key: PARQUET-517
> URL: https://issues.apache.org/jira/browse/PARQUET-517
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Anatoli Shein
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> We are using {{std::vector}} in many places for memory allocation; if we want 
> to use SSE on this memory we may run into some problems.
> Couple things we should do
> * Add an STL allocator for {{std::vector}} that ensure 16-byte aligned memory
> * Check user-provided memory for alignment before utilizing an 
> SSE-accelerated routine (e.g. SSE hash functions for dictionary encoding) and 
> decide whether to copy and use SSE or no-copy and use no-SSE code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-517) [C++] Use arrow::MemoryPool for all heap allocations

2019-05-20 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844020#comment-16844020
 ] 

Deepak Majeti commented on PARQUET-517:
---

The {{std::vector buffered_indices_}} inside {{DictEncoder}} can benefit 
using an arrow::MemoryPool. This is significant when there are many columns.

> [C++] Use arrow::MemoryPool for all heap allocations
> 
>
> Key: PARQUET-517
> URL: https://issues.apache.org/jira/browse/PARQUET-517
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> We are using {{std::vector}} in many places for memory allocation; if we want 
> to use SSE on this memory we may run into some problems.
> Couple things we should do
> * Add an STL allocator for {{std::vector}} that ensure 16-byte aligned memory
> * Check user-provided memory for alignment before utilizing an 
> SSE-accelerated routine (e.g. SSE hash functions for dictionary encoding) and 
> decide whether to copy and use SSE or no-copy and use no-SSE code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-517) [C++] Use arrow::MemoryPool for all heap allocations

2019-05-20 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-517:
-

Assignee: Deepak Majeti  (was: Wes McKinney)

> [C++] Use arrow::MemoryPool for all heap allocations
> 
>
> Key: PARQUET-517
> URL: https://issues.apache.org/jira/browse/PARQUET-517
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> We are using {{std::vector}} in many places for memory allocation; if we want 
> to use SSE on this memory we may run into some problems.
> Couple things we should do
> * Add an STL allocator for {{std::vector}} that ensure 16-byte aligned memory
> * Check user-provided memory for alignment before utilizing an 
> SSE-accelerated routine (e.g. SSE hash functions for dictionary encoding) and 
> decide whether to copy and use SSE or no-copy and use no-SSE code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1405) [C++] 'Couldn't deserialize thrift' error when reading large binary column

2019-04-29 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829769#comment-16829769
 ] 

Deepak Majeti commented on PARQUET-1405:


Filed https://issues.apache.org/jira/browse/ARROW-5241

> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --
>
> Key: PARQUET-1405
> URL: https://issues.apache.org/jira/browse/PARQUET-1405
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
>Reporter: Jeremy Heffner
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.6.0
>
> Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns 
> (utf8 strings).  In particular, we were generating WKT representations of 
> polygons that contained ~34 million characters when we ran into the issue. 
> The attached example generates a dataframe with one record and one column 
> containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but 
> fails upon reading the file:
> {code:java}
> ---
> ArrowIOError Traceback (most recent call last)
>  in ()
> > 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read_parquet(path, engine, columns, **kwargs)
> 286 
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047 
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in 
> read_parquet(self, path, columns, metadata, schema, nthreads, 
> use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178 
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900 
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, partitions, open_file_func, file, 
> use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462 
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153 
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1405) [C++] 'Couldn't deserialize thrift' error when reading large binary column

2019-04-29 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829711#comment-16829711
 ] 

Deepak Majeti edited comment on PARQUET-1405 at 4/29/19 8:59 PM:
-

PARQUET-979 omits large statistics inside ColumnMetaData but missed omitting 
large statistics inside the DataPageHeader. I will fix this.
Disabling statistics when writing is a workaround, but I don't see any option 
to disable statistics in the Python API.
[~wesmckinn] or [~xhochy] must correct me here.


was (Author: mdeepak):
PARQUET-979 omits large statistics inside ColumnMetaData but missed omitting 
large statistics inside the DataPageHeader. I will fix this.
Disabling statistics is a workaround, but I don't see any option to disable 
statistics in the Python API.
[~wesmckinn] or [~xhochy] must correct me here.

> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --
>
> Key: PARQUET-1405
> URL: https://issues.apache.org/jira/browse/PARQUET-1405
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
>Reporter: Jeremy Heffner
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.6.0
>
> Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns 
> (utf8 strings).  In particular, we were generating WKT representations of 
> polygons that contained ~34 million characters when we ran into the issue. 
> The attached example generates a dataframe with one record and one column 
> containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but 
> fails upon reading the file:
> {code:java}
> ---
> ArrowIOError Traceback (most recent call last)
>  in ()
> > 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read_parquet(path, engine, columns, **kwargs)
> 286 
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047 
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in 
> read_parquet(self, path, columns, metadata, schema, nthreads, 
> use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178 
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900 
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, partitions, open_file_func, file, 
> use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462 
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153 
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1405) [C++] 'Couldn't deserialize thrift' error when reading large binary column

2019-04-29 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829711#comment-16829711
 ] 

Deepak Majeti commented on PARQUET-1405:


PARQUET-979 omits large statistics inside ColumnMetaData but missed omitting 
large statistics inside the DataPageHeader. I will fix this.
Disabling statistics is a workaround, but I don't see any option to disable 
statistics in the Python API.
[~wesmckinn] or [~xhochy] must correct me here.

> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --
>
> Key: PARQUET-1405
> URL: https://issues.apache.org/jira/browse/PARQUET-1405
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
>Reporter: Jeremy Heffner
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.6.0
>
> Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns 
> (utf8 strings).  In particular, we were generating WKT representations of 
> polygons that contained ~34 million characters when we ran into the issue. 
> The attached example generates a dataframe with one record and one column 
> containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but 
> fails upon reading the file:
> {code:java}
> ---
> ArrowIOError Traceback (most recent call last)
>  in ()
> > 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read_parquet(path, engine, columns, **kwargs)
> 286 
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047 
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in 
> read_parquet(self, path, columns, metadata, schema, nthreads, 
> use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178 
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900 
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, partitions, open_file_func, file, 
> use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462 
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153 
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1405) [C++] 'Couldn't deserialize thrift' error when reading large binary column

2019-04-29 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1405:
--

Assignee: Deepak Majeti

> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --
>
> Key: PARQUET-1405
> URL: https://issues.apache.org/jira/browse/PARQUET-1405
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
>Reporter: Jeremy Heffner
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.6.0
>
> Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns 
> (utf8 strings).  In particular, we were generating WKT representations of 
> polygons that contained ~34 million characters when we ran into the issue. 
> The attached example generates a dataframe with one record and one column 
> containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but 
> fails upon reading the file:
> {code:java}
> ---
> ArrowIOError Traceback (most recent call last)
>  in ()
> > 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read_parquet(path, engine, columns, **kwargs)
> 286 
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047 
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in 
> read_parquet(self, path, columns, metadata, schema, nthreads, 
> use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178 
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900 
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, partitions, open_file_func, file, 
> use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462 
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153 
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1523) [C++] Vectorize comparator interface

2019-02-05 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761079#comment-16761079
 ] 

Deepak Majeti commented on PARQUET-1523:


Sure!

> [C++] Vectorize comparator interface
> 
>
> Key: PARQUET-1523
> URL: https://issues.apache.org/jira/browse/PARQUET-1523
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> The {{parquet::Comparator}} interface yields scalar virtual calls on the 
> innermost loop. In addition to removing the usage of 
> {{PARQUET_TEMPLATE_EXPORT}} as with other recent patches, I propose to 
> refactor to a vector-based comparison to update the minimum and maximum 
> elements in a single virtual call
> cc [~mdeepak] [~xhochy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1523) [C++] Vectorize comparator interface

2019-02-05 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760747#comment-16760747
 ] 

Deepak Majeti commented on PARQUET-1523:


I can work on this. I need to make some changes to the statistics API as well.

> [C++] Vectorize comparator interface
> 
>
> Key: PARQUET-1523
> URL: https://issues.apache.org/jira/browse/PARQUET-1523
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> The {{parquet::Comparator}} interface yields scalar virtual calls on the 
> innermost loop. In addition to removing the usage of 
> {{PARQUET_TEMPLATE_EXPORT}} as with other recent patches, I propose to 
> refactor to a vector-based comparison to update the minimum and maximum 
> elements in a single virtual call
> cc [~mdeepak] [~xhochy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1118) Build a corpus of Parquet files that client implementations can use for validation

2019-01-28 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16754027#comment-16754027
 ] 

Deepak Majeti commented on PARQUET-1118:


The Parquet format is being extended with many new features such as indexes, 
correct statistics, etc. Having compatibility across various writers 
(parquet-mr, parquet-cpp, Impala, etc.) is very important for the community to 
trust/depend on the Parquet file format. We should discuss this Jira in our 
next sync and start working towards improving the compatibility.

> Build a corpus of Parquet files that client implementations can use for 
> validation
> --
>
> Key: PARQUET-1118
> URL: https://issues.apache.org/jira/browse/PARQUET-1118
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Lars Volker
>Priority: Major
>
> We should build a corpus of Parquet files that client implementations can use 
> for validation. In addition to the input files, it should contain a 
> description or a verbatim copy of the data in each file, so that readers can 
> validate their results.
> As a starting point we can look at [the old parquet-compatibility 
> repo|https://github.com/Parquet/parquet-compatibility] and [Impala's test 
> data, in particular the Parquet files it 
> contains|https://github.com/apache/incubator-impala/tree/master/testdata].
> {noformat}
> $ find testdata | grep -i parq
> testdata/workloads/tpch/queries/insert_parquet.test
> testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
> testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-zero-rows.test
> testdata/workloads/functional-query/queries/QueryTest/insert_parquet_invalid_codec.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts-abort.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test
> testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-abort-on-error.test
> testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet.test
> testdata/workloads/functional-query/queries/QueryTest/parquet.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test
> testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet-nested.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test
> testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
> testdata/max_nesting_depth/int_map/file.parq
> testdata/max_nesting_depth/struct/file.parq
> testdata/max_nesting_depth/struct_map/file.parq
> testdata/max_nesting_depth/int_array/file.parq
> testdata/max_nesting_depth/struct_array/file.parq
> testdata/parquet_nested_types_encodings
> testdata/parquet_nested_types_encodings/README
> testdata/parquet_nested_types_encodings/UnannotatedListOfGroups.parquet
> testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet
> testdata/parquet_nested_types_encodings/UnannotatedListOfPrimitives.parquet
> testdata/parquet_nested_types_encodings/AmbiguousList.json
> testdata/parquet_nested_types_encodings/AvroPrimitiveInList.parquet
> testdata/parquet_nested_types_encodings/ThriftPrimitiveInList.parquet
> testdata/parquet_nested_types_encodings/bad-avro.parquet
> testdata/parquet_nested_types_encodings/AmbiguousList.avsc
> testdata/parquet_nested_types_encodings/SingleFieldGroupInList.parquet
> testdata/parquet_nested_types_encodings/ThriftSingleFieldGroupInList.parquet
> testdata/parquet_nested_types_encodings/AvroSingleFieldGroupInList.parquet
> testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet
> testdata/parquet_nested_types_encodings/bad-thrift.parquet
> testdata/ComplexTypesTbl/nonnullable.parq
> testdata/ComplexTypesTbl/nullable.parq
> testdata/bad_parquet_data
> testdata/bad_parquet_data/README
> testdata/bad_parquet_data/dict-encoded-out-of-bounds.parq
> testdata/bad_parquet_data/plain-encoded-negative-len.parq
> testdata/bad_parquet_data/plain-encoded-out-of-bounds.parq
> testdata/bad_parquet_data/dict-encoded-negative-len.parq
> testdata/parquet_schema_resolution
> testdata/parquet_schema_resolution/README
> 

[jira] [Created] (PARQUET-1515) [C++] Disable LZ4 codec

2019-01-27 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1515:
--

 Summary: [C++] Disable LZ4 codec
 Key: PARQUET-1515
 URL: https://issues.apache.org/jira/browse/PARQUET-1515
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Deepak Majeti
 Fix For: cpp-1.6.0


As discussed in https://issues.apache.org/jira/browse/PARQUET-1241, the 
parquet-cpp's LZ4 codec is not compatible with Hadoop and parquet-mr. We must 
disable the codec until we resolve the compatibility issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1484) [C++] Improve memory usage of FileMetaDataBuilder

2018-12-27 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1484:
---
Fix Version/s: cpp-1.6.0

> [C++] Improve memory usage of FileMetaDataBuilder
> -
>
> Key: PARQUET-1484
> URL: https://issues.apache.org/jira/browse/PARQUET-1484
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> Changes in the PR for ARROW-3324 
> ([https://github.com/apache/arrow/pull/3261)] allow further improving the 
> memory usage by avoiding a copy of the row group metadata inside the Finish() 
> implementation of the FileMetaDataBuilder class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1484) [C++] Improve memory usage of FileMetaDataBuilder

2018-12-27 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1484:
--

 Summary: [C++] Improve memory usage of FileMetaDataBuilder
 Key: PARQUET-1484
 URL: https://issues.apache.org/jira/browse/PARQUET-1484
 Project: Parquet
  Issue Type: Improvement
Reporter: Deepak Majeti
Assignee: Deepak Majeti


Changes in the PR for ARROW-3324 ([https://github.com/apache/arrow/pull/3261)] 
allow further improving the memory usage by avoiding a copy of the row group 
metadata inside the Finish() implementation of the FileMetaDataBuilder class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1439) [C++] Parquet build fails when PARQUET_ARROW_LINKAGE is static

2018-10-09 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1439:
--

 Summary: [C++] Parquet build fails when PARQUET_ARROW_LINKAGE is 
static
 Key: PARQUET-1439
 URL: https://issues.apache.org/jira/browse/PARQUET-1439
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: cpp-1.6.0


The error is as follows
{noformat}
CMake Error at cmake_modules/BuildUtils.cmake:145 (add_dependencies):
  The dependency target "/usr/lib/x86_64-linux-gnu/libpthread.so" of target
  "parquet_objlib" does not exist.
Call Stack (most recent call first):
  src/parquet/CMakeLists.txt:183 (ADD_ARROW_LIB{noformat}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1431) [C++] Automaticaly set thrift to use boost for thrift versions before 0.11

2018-09-27 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1431:
--

 Summary: [C++] Automaticaly set thrift to use boost for thrift 
versions before 0.11
 Key: PARQUET-1431
 URL: https://issues.apache.org/jira/browse/PARQUET-1431
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: cpp-1.6.0


PARQUET_THRIFT_USE_BOOST is a cmake option. But instead parquet should 
automatically set the definition PARQUET_THRIFT_USE_BOOST based on the thrift 
version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1430) [C++] Add tests for C++ tools

2018-09-27 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1430:
---
Fix Version/s: (was: 1.6.1)
   cpp-1.6.0

> [C++] Add tests for C++ tools
> -
>
> Key: PARQUET-1430
> URL: https://issues.apache.org/jira/browse/PARQUET-1430
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> We currently do not have any tests for the tools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1430) [C++] Add tests for C++ tools

2018-09-27 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1430:
--

 Summary: [C++] Add tests for C++ tools
 Key: PARQUET-1430
 URL: https://issues.apache.org/jira/browse/PARQUET-1430
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: 1.6.1


We currently do not have any tests for the tools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1426) [C++] parquet-dump-schema has poor usability

2018-09-27 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630484#comment-16630484
 ] 

Deepak Majeti commented on PARQUET-1426:


We should add tests for all the tools as well. I will open a Jira for that.

> [C++] parquet-dump-schema has poor usability
> 
>
> Key: PARQUET-1426
> URL: https://issues.apache.org/jira/browse/PARQUET-1426
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> {code}
> $ ./debug/parquet-dump-schema
> terminate called after throwing an instance of 'std::logic_error'
>   what():  basic_string::_S_construct null not valid
> Aborted (core dumped)
> $ ./debug/parquet-dump-schema --help
> Parquet error: Arrow error: IOError: ../src/arrow/io/file.cc:508 code: 
> result->memory_map_->Open(path, mode)
> ../src/arrow/io/file.cc:380 code: file_->OpenReadable(path)
> ../src/arrow/io/file.cc:99 code: internal::FileOpenReadable(file_name_, _)
> Failed to open local file: --help , error: No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1372) [C++] Add an API to allow writing RowGroups based on their size rather than num_rows

2018-08-19 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1372:
---
Fix Version/s: (was: 1.5.0)
   cpp-1.5.0

> [C++] Add an API to allow writing RowGroups based on their size rather than 
> num_rows
> 
>
> Key: PARQUET-1372
> URL: https://issues.apache.org/jira/browse/PARQUET-1372
> Project: Parquet
>  Issue Type: Task
>Reporter: Anatoli Shein
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> The current API allows writing RowGroups with specified numbers of rows, 
> however does not allow writing RowGroups with specified size. In order to 
> write RowGroups of specified size we need to write rows in chunks while 
> checking the total_bytes_written after each chunk is written. This is 
> currently impossible because the call to NextColumn() closes the current 
> column writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1378) [c++] Allow RowGroups with zero rows to be written

2018-08-13 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1378:
--

 Summary: [c++] Allow RowGroups with zero rows to be written
 Key: PARQUET-1378
 URL: https://issues.apache.org/jira/browse/PARQUET-1378
 Project: Parquet
  Issue Type: Improvement
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: 1.5.0


Currently, the reader-writer.cc example fails when zero rows are written.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1372) [C++] Add an API to allow writing RowGroups based on their size rather than num_rows

2018-08-13 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1372:
--

Assignee: Deepak Majeti  (was: Anatoli Shein)

> [C++] Add an API to allow writing RowGroups based on their size rather than 
> num_rows
> 
>
> Key: PARQUET-1372
> URL: https://issues.apache.org/jira/browse/PARQUET-1372
> Project: Parquet
>  Issue Type: Task
>Reporter: Anatoli Shein
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: 1.5.0
>
>
> The current API allows writing RowGroups with specified numbers of rows, 
> however does not allow writing RowGroups with specified size. In order to 
> write RowGroups of specified size we need to write rows in chunks while 
> checking the total_bytes_written after each chunk is written. This is 
> currently impossible because the call to NextColumn() closes the current 
> column writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1301) [C++] Crypto package in parquet-cpp

2018-08-04 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1301:
---
Fix Version/s: 1.5.0

> [C++] Crypto package in parquet-cpp
> ---
>
> Key: PARQUET-1301
> URL: https://issues.apache.org/jira/browse/PARQUET-1301
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.5.0
>
>
> The C++ implementation of basic AES-GCM encryption and decryption



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1333) [C++] Reading of files with dictionary size 0 fails on Windows with bad_alloc

2018-06-28 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-1333.

Resolution: Fixed

Issue resolved by pull request 472
[https://github.com/apache/parquet-cpp/pull/472]

> [C++] Reading of files with dictionary size 0 fails on Windows with bad_alloc
> -
>
> Key: PARQUET-1333
> URL: https://issues.apache.org/jira/browse/PARQUET-1333
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
> Environment: Microsoft Windows 10 Pro with latest arrow master.
>Reporter: Philipp Hoch
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> Account for total_size being 0, having no dictionary entries to allocate for.
> The call with size 0 ends up in arrows memory_pool, 
> [https://github.com/apache/arrow/blob/884474ca5ca1b8da55c0b23eb7cb784c2cd9bdb4/cpp/src/arrow/memory_pool.cc#L50],
>  and the according allocation fails. See according documentation, 
> [https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/aligned-malloc].
>  Only happens on Windows environment, as posix_memalign seems to handle 0 
> inputs in unix environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1334) [C++] memory_map parameter seems missleading in parquet file opener

2018-06-28 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-1334.

   Resolution: Fixed
Fix Version/s: cpp-1.5.0

Issue resolved by pull request 471
[https://github.com/apache/parquet-cpp/pull/471]

> [C++] memory_map parameter seems missleading in parquet file opener
> ---
>
> Key: PARQUET-1334
> URL: https://issues.apache.org/jira/browse/PARQUET-1334
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Philipp Hoch
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> If memory_map parameter is true, normal file operation is executed, while in 
> negative case, the according memory mapped file operation happens. Seems 
> either be used via inverted logic or being bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1340) [C++] Fix Travis Ci valgrind errors related to std::random_device

2018-06-28 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-1340.

   Resolution: Fixed
Fix Version/s: cpp-1.5.0

Issue resolved by pull request 473
[https://github.com/apache/parquet-cpp/pull/473]

> [C++] Fix Travis Ci valgrind errors related to std::random_device
> -
>
> Key: PARQUET-1340
> URL: https://issues.apache.org/jira/browse/PARQUET-1340
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> [https://travis-ci.org/apache/parquet-cpp/jobs/395095122]
> ==12164== Conditional jump or move depends on uninitialised value(s)
> ==12164== at 0x510FFD8: std::random_device::_M_init(std::string const&) 
> (cow-string-inst.cc:56)
> ==12164== by 0x73EE6E: std::random_device::random_device(std::string const&) 
> (random.h:1590)
> ==12164== by 0x727C0C: parquet::test::flip_coins(int, double) 
> (test-common.h:104)
> ==12164== by 0x729421: void parquet::test::InitValues(int, 
> std::vector >&, std::vector std::allocator >&) (test-specialization.h:40)
> ==12164== by 0x72E09D: int 
> parquet::test::MakePages 
> >(parquet::ColumnDescriptor const*, int, int, std::vector std::allocator >&, std::vector >&, 
> std::vector::c_type, 
> std::allocator::c_type> >&, 
> std::vector >&, 
> std::vector, 
> std::allocator > >&, parquet::Encoding::type) 
> (test-util.h:429)
> ==12164== by 0x75761D: 
> parquet::test::TestFlatScanner 
> >::Execute(int, int, int, parquet::ColumnDescriptor const*, 
> parquet::Encoding::type) (column_scanner-test.cc:96)
> ==12164== by 0x74F955: 
> parquet::test::TestFlatScanner 
> >::ExecuteAll(int, int, int, int, parquet::Encoding::type) 
> (column_scanner-test.cc:125)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1340) [C++] Fix Travis Ci valgrind errors related to std::random_device

2018-06-27 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1340:
--

 Summary: [C++] Fix Travis Ci valgrind errors related to 
std::random_device
 Key: PARQUET-1340
 URL: https://issues.apache.org/jira/browse/PARQUET-1340
 Project: Parquet
  Issue Type: Improvement
Reporter: Deepak Majeti
Assignee: Deepak Majeti


[https://travis-ci.org/apache/parquet-cpp/jobs/395095122]
==12164== Conditional jump or move depends on uninitialised value(s)
==12164== at 0x510FFD8: std::random_device::_M_init(std::string const&) 
(cow-string-inst.cc:56)
==12164== by 0x73EE6E: std::random_device::random_device(std::string const&) 
(random.h:1590)
==12164== by 0x727C0C: parquet::test::flip_coins(int, double) 
(test-common.h:104)
==12164== by 0x729421: void parquet::test::InitValues(int, 
std::vector >&, std::vector >&) (test-specialization.h:40)
==12164== by 0x72E09D: int 
parquet::test::MakePages 
>(parquet::ColumnDescriptor const*, int, int, std::vector >&, std::vector >&, 
std::vector::c_type, 
std::allocator::c_type> >&, 
std::vector >&, 
std::vector, 
std::allocator > >&, parquet::Encoding::type) 
(test-util.h:429)
==12164== by 0x75761D: 
parquet::test::TestFlatScanner 
>::Execute(int, int, int, parquet::ColumnDescriptor const*, 
parquet::Encoding::type) (column_scanner-test.cc:96)
==12164== by 0x74F955: 
parquet::test::TestFlatScanner 
>::ExecuteAll(int, int, int, int, parquet::Encoding::type) 
(column_scanner-test.cc:125)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1300) [C++] Parquet modular encryption

2018-06-26 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523726#comment-16523726
 ] 

Deepak Majeti commented on PARQUET-1300:


[~thamha], I have not started writing any code in either of the layers 
[~gershinsky] mentioned. I will be happy to work with you on the parquet-cpp 
API as well as the Java interop testing. Can you open a new pull request on the 
parquet-cpp project with your current code? We can get some early feedback and 
it will help us better design the API.

> [C++] Parquet modular encryption
> 
>
> Key: PARQUET-1300
> URL: https://issues.apache.org/jira/browse/PARQUET-1300
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Gidon Gershinsky
>Assignee: Deepak Majeti
>Priority: Major
> Attachments: column_reader.cc, column_writer.cc, file_reader.cc, 
> file_writer.cc
>
>
> CPP version of a mechanism for modular encryption and decryption of Parquet 
> files. Allows to keep the data fully encrypted in the storage, while enabling 
> a client to extract a required subset (footer, column(s), pages) and to 
> authenticate / decrypt the extracted data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1315) [C++] ColumnChunkMetaData.has_dictionary_page() should return bool, not int64_t

2018-05-29 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1315:
--

Assignee: Deepak Majeti

> [C++] ColumnChunkMetaData.has_dictionary_page() should return bool, not 
> int64_t
> ---
>
> Key: PARQUET-1315
> URL: https://issues.apache.org/jira/browse/PARQUET-1315
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Deepak Majeti
>Priority: Major
>
> It's semantically a boolean.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-979) [C++] Limit size of min, max or disable stats for long binary types

2018-05-20 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-979.
---
   Resolution: Fixed
Fix Version/s: (was: cpp-1.4.0)
   cpp-1.5.0

Issue resolved by pull request 465
[https://github.com/apache/parquet-cpp/pull/465]

> [C++] Limit size of min, max or disable stats for long binary types
> ---
>
> Key: PARQUET-979
> URL: https://issues.apache.org/jira/browse/PARQUET-979
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> Other Parquet implementations like parquet-mr disable min/max values for long 
> binary types > 4KB. For known logical types comparisons, we could approximate 
> min/max values. We need to implement this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1300) [C++] Parquet modular encryption

2018-05-16 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1300:
--

Assignee: Deepak Majeti

> [C++] Parquet modular encryption
> 
>
> Key: PARQUET-1300
> URL: https://issues.apache.org/jira/browse/PARQUET-1300
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Gidon Gershinsky
>Assignee: Deepak Majeti
>Priority: Major
> Attachments: column_reader.cc, column_writer.cc, file_reader.cc, 
> file_writer.cc
>
>
> CPP version of a mechanism for modular encryption and decryption of Parquet 
> files. Allows to keep the data fully encrypted in the storage, while enabling 
> a client to extract a required subset (footer, column(s), pages) and to 
> authenticate / decrypt the extracted data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (PARQUET-1252) [C++] Pass BOOST_ROOT and Boost_NAMESPACE on to Thrift EP

2018-04-23 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reopened PARQUET-1252:


Sorry! Its not a duplicate of 
[PARQUET-1262|https://github.com/apache/parquet-cpp/commit/26422f58c47cfd44b003248fe8bac05f1a65bb4d]

> [C++] Pass BOOST_ROOT and Boost_NAMESPACE on to Thrift EP
> -
>
> Key: PARQUET-1252
> URL: https://issues.apache.org/jira/browse/PARQUET-1252
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> Currently we build the {{thrift_ep}} with the Boost version it finds by 
> itself. In the case where {{parquet-cpp}} is built with a very specific Boost 
> version, we also need to build it using this version. This needs passing 
> along of {{BOOST_ROOT}} and {{Boost_NAMESPACE}} to the Thrift ExternalProject.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1252) [C++] Pass BOOST_ROOT and Boost_NAMESPACE on to Thrift EP

2018-04-23 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-1252.

Resolution: Duplicate

> [C++] Pass BOOST_ROOT and Boost_NAMESPACE on to Thrift EP
> -
>
> Key: PARQUET-1252
> URL: https://issues.apache.org/jira/browse/PARQUET-1252
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> Currently we build the {{thrift_ep}} with the Boost version it finds by 
> itself. In the case where {{parquet-cpp}} is built with a very specific Boost 
> version, we also need to build it using this version. This needs passing 
> along of {{BOOST_ROOT}} and {{Boost_NAMESPACE}} to the Thrift ExternalProject.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1021) [C++] Print more helpful failure message when PARQUET_TEST_DATA environment variable is not set

2018-04-23 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-1021.

Resolution: Fixed

Fixed with PARQUET-1255

> [C++] Print more helpful failure message when PARQUET_TEST_DATA environment 
> variable is not set
> ---
>
> Key: PARQUET-1021
> URL: https://issues.apache.org/jira/browse/PARQUET-1021
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Affects Versions: cpp-1.1.0
>Reporter: Wes McKinney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1269) [C++] Scanning fails with list columns

2018-04-23 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-1269.

Resolution: Fixed

Resolved in PARQUET-1272

> [C++] Scanning fails with list columns
> --
>
> Key: PARQUET-1269
> URL: https://issues.apache.org/jira/browse/PARQUET-1269
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> {code:python}
> >>> list_arr = pa.array([[1, 2], [3, 4, 5]])
> >>> int_arr = pa.array([10, 11])
> >>> table = pa.Table.from_arrays([int_arr, list_arr], ['ints', 'lists'])
> >>> bio = io.BytesIO()
> >>> pq.write_table(table, bio)
> >>> bio.seek(0)
> 0
> >>> reader = pq.ParquetReader()
> >>> reader.open(bio)
> >>> reader.scan_contents()
> Traceback (most recent call last):
>   File "", line 1, in 
> reader.scan_contents()
>   File "_parquet.pyx", line 753, in 
> pyarrow._parquet.ParquetReader.scan_contents
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> ArrowIOError: Parquet error: Total rows among columns do not match
> {code}
> ScanFileContents() claims it returns the "number of semantic rows" but 
> apparently it actually counts the number of physical elements?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1262) [C++] Use the same BOOST_ROOT and Boost_NAMESPACE for Thrift

2018-04-23 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-1262.

Resolution: Fixed

Issue resolved by pull request 460
[https://github.com/apache/parquet-cpp/pull/460]

> [C++] Use the same BOOST_ROOT and Boost_NAMESPACE for Thrift 
> -
>
> Key: PARQUET-1262
> URL: https://issues.apache.org/jira/browse/PARQUET-1262
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> When building Thrift using the ExternalProject facility, we do not pass on 
> the variables for a custom Boost variant. Thus if the user uses a differently 
> flavoured/located Boost, Thrift does not pick it up. As a cause of this, we 
> explicitly build Thrift during the Arrow OS X Wheel build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1265) Segfault on static ApplicationVersion initialization

2018-04-12 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-1265.

Resolution: Fixed

> Segfault on static ApplicationVersion initialization
> 
>
> Key: PARQUET-1265
> URL: https://issues.apache.org/jira/browse/PARQUET-1265
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Lawrence Chan
>Assignee: Deepak Majeti
>Priority: Major
>
> I'm seeing a segfault when I link/run with a shared libparquet.so with 
> statically linked boost. Given the backtrace, it seems that this is due to 
> the static ApplicationVersion constants, likely due to some static 
> initialization order issue. The problem goes away if I turn those static vars 
> into static funcs returning function-local statics.
> Backtrace:
> {code}
> #0  0x7753cf8b in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /lib64/libstdc++.so.6
> #1  0x77aeae9c in 
> boost::re_detail_106600::cpp_regex_traits_char_layer::init() () from 
> debug/libparquet.so.1
> #2  0x77adcc2b in 
> boost::object_cache boost::re_detail_106600::cpp_regex_traits_implementation 
> >::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, 
> unsigned long) () from debug/libparquet.so.1
> #3  0x77ae9023 in boost::basic_regex boost::cpp_regex_traits > >::do_assign(char const*, char const*, 
> unsigned int) () from debug/libparquet.so.1
> #4  0x77a5ed98 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff5580, 
> p1=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  p2=0x77af6720 "", f=0) at 
> /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:381
> #5  0x77a5b653 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff5580, 
> p=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:366
> #6  0x77a57049 in boost::basic_regex boost::cpp_regex_traits > >::basic_regex (this=0x7fff5580, 
> p=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:335
> #7  0x77a4fa1f in parquet::ApplicationVersion::ApplicationVersion 
> (this=0x77ddbfc0 
> , 
> created_by="parquet-mr version 1.8.0") at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:477
> #8  0x77a516c5 in __static_initialization_and_destruction_0 
> (__initialize_p=1, __priority=65535) at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:58
> #9  0x77a5179e in _GLOBAL__sub_I_metadata.cc(void) () at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:913
> #10 0x77dec1e3 in _dl_init_internal () from 
> /lib64/ld-linux-x86-64.so.2
> #11 0x77dde21a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
> #12 0x0001 in ?? ()
> #13 0x7fff5ff5 in ?? ()
> #14 0x in ?? ()
> {code}
> Versions:
> - gcc-4.8.5
> - boost-1.66.0
> - parquet-cpp-1.4.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1265) Segfault on static ApplicationVersion initialization

2018-04-09 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1265:
--

Assignee: Deepak Majeti

> Segfault on static ApplicationVersion initialization
> 
>
> Key: PARQUET-1265
> URL: https://issues.apache.org/jira/browse/PARQUET-1265
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Lawrence Chan
>Assignee: Deepak Majeti
>Priority: Major
>
> I'm seeing a segfault when I link/run with a shared libparquet.so with 
> statically linked boost. Given the backtrace, it seems that this is due to 
> the static ApplicationVersion constants, likely due to some static 
> initialization order issue. The problem goes away if I turn those static vars 
> into static funcs returning function-local statics.
> Backtrace:
> {code}
> #0  0x7753cf8b in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /lib64/libstdc++.so.6
> #1  0x77aeae9c in 
> boost::re_detail_106600::cpp_regex_traits_char_layer::init() () from 
> debug/libparquet.so.1
> #2  0x77adcc2b in 
> boost::object_cache boost::re_detail_106600::cpp_regex_traits_implementation 
> >::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, 
> unsigned long) () from debug/libparquet.so.1
> #3  0x77ae9023 in boost::basic_regex boost::cpp_regex_traits > >::do_assign(char const*, char const*, 
> unsigned int) () from debug/libparquet.so.1
> #4  0x77a5ed98 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff5580, 
> p1=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  p2=0x77af6720 "", f=0) at 
> /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:381
> #5  0x77a5b653 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff5580, 
> p=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:366
> #6  0x77a57049 in boost::basic_regex boost::cpp_regex_traits > >::basic_regex (this=0x7fff5580, 
> p=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:335
> #7  0x77a4fa1f in parquet::ApplicationVersion::ApplicationVersion 
> (this=0x77ddbfc0 
> , 
> created_by="parquet-mr version 1.8.0") at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:477
> #8  0x77a516c5 in __static_initialization_and_destruction_0 
> (__initialize_p=1, __priority=65535) at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:58
> #9  0x77a5179e in _GLOBAL__sub_I_metadata.cc(void) () at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:913
> #10 0x77dec1e3 in _dl_init_internal () from 
> /lib64/ld-linux-x86-64.so.2
> #11 0x77dde21a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
> #12 0x0001 in ?? ()
> #13 0x7fff5ff5 in ?? ()
> #14 0x in ?? ()
> {code}
> Versions:
> - gcc-4.8.5
> - boost-1.66.0
> - parquet-cpp-1.4.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1265) Segfault on static ApplicationVersion initialization

2018-04-06 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429068#comment-16429068
 ] 

Deepak Majeti commented on PARQUET-1265:


I will be happy to work on the PR for this if you are okay with it.

> Segfault on static ApplicationVersion initialization
> 
>
> Key: PARQUET-1265
> URL: https://issues.apache.org/jira/browse/PARQUET-1265
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Lawrence Chan
>Priority: Major
>
> I'm seeing a segfault when I link/run with a shared libparquet.so with 
> statically linked boost. Given the backtrace, it seems that this is due to 
> the static ApplicationVersion constants, likely due to some static 
> initialization order issue. The problem goes away if I turn those static vars 
> into static funcs returning function-local statics.
> Backtrace:
> {code}
> #0  0x7753cf8b in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /lib64/libstdc++.so.6
> #1  0x77aeae9c in 
> boost::re_detail_106600::cpp_regex_traits_char_layer::init() () from 
> debug/libparquet.so.1
> #2  0x77adcc2b in 
> boost::object_cache boost::re_detail_106600::cpp_regex_traits_implementation 
> >::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, 
> unsigned long) () from debug/libparquet.so.1
> #3  0x77ae9023 in boost::basic_regex boost::cpp_regex_traits > >::do_assign(char const*, char const*, 
> unsigned int) () from debug/libparquet.so.1
> #4  0x77a5ed98 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff5580, 
> p1=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  p2=0x77af6720 "", f=0) at 
> /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:381
> #5  0x77a5b653 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff5580, 
> p=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:366
> #6  0x77a57049 in boost::basic_regex boost::cpp_regex_traits > >::basic_regex (this=0x7fff5580, 
> p=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:335
> #7  0x77a4fa1f in parquet::ApplicationVersion::ApplicationVersion 
> (this=0x77ddbfc0 
> , 
> created_by="parquet-mr version 1.8.0") at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:477
> #8  0x77a516c5 in __static_initialization_and_destruction_0 
> (__initialize_p=1, __priority=65535) at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:58
> #9  0x77a5179e in _GLOBAL__sub_I_metadata.cc(void) () at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:913
> #10 0x77dec1e3 in _dl_init_internal () from 
> /lib64/ld-linux-x86-64.so.2
> #11 0x77dde21a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
> #12 0x0001 in ?? ()
> #13 0x7fff5ff5 in ?? ()
> #14 0x in ?? ()
> {code}
> Versions:
> - gcc-4.8.5
> - boost-1.66.0
> - parquet-cpp-1.4.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1265) Segfault on static ApplicationVersion initialization

2018-04-06 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428531#comment-16428531
 ] 

Deepak Majeti commented on PARQUET-1265:


Getting this in will be nice. The fix basically delays the initialization of 
static variables which seem to resolve the linking issues.

> Segfault on static ApplicationVersion initialization
> 
>
> Key: PARQUET-1265
> URL: https://issues.apache.org/jira/browse/PARQUET-1265
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Lawrence Chan
>Priority: Major
>
> I'm seeing a segfault when I link/run with a shared libparquet.so with 
> statically linked boost. Given the backtrace, it seems that this is due to 
> the static ApplicationVersion constants, likely due to some static 
> initialization order issue. The problem goes away if I turn those static vars 
> into static funcs returning function-local statics.
> Backtrace:
> {code}
> #0  0x7753cf8b in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /lib64/libstdc++.so.6
> #1  0x77aeae9c in 
> boost::re_detail_106600::cpp_regex_traits_char_layer::init() () from 
> debug/libparquet.so.1
> #2  0x77adcc2b in 
> boost::object_cache boost::re_detail_106600::cpp_regex_traits_implementation 
> >::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, 
> unsigned long) () from debug/libparquet.so.1
> #3  0x77ae9023 in boost::basic_regex boost::cpp_regex_traits > >::do_assign(char const*, char const*, 
> unsigned int) () from debug/libparquet.so.1
> #4  0x77a5ed98 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff5580, 
> p1=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  p2=0x77af6720 "", f=0) at 
> /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:381
> #5  0x77a5b653 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff5580, 
> p=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:366
> #6  0x77a57049 in boost::basic_regex boost::cpp_regex_traits > >::basic_regex (this=0x7fff5580, 
> p=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:335
> #7  0x77a4fa1f in parquet::ApplicationVersion::ApplicationVersion 
> (this=0x77ddbfc0 
> , 
> created_by="parquet-mr version 1.8.0") at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:477
> #8  0x77a516c5 in __static_initialization_and_destruction_0 
> (__initialize_p=1, __priority=65535) at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:58
> #9  0x77a5179e in _GLOBAL__sub_I_metadata.cc(void) () at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:913
> #10 0x77dec1e3 in _dl_init_internal () from 
> /lib64/ld-linux-x86-64.so.2
> #11 0x77dde21a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
> #12 0x0001 in ?? ()
> #13 0x7fff5ff5 in ?? ()
> #14 0x in ?? ()
> {code}
> Versions:
> - gcc-4.8.5
> - boost-1.66.0
> - parquet-cpp-1.4.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-20 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370767#comment-16370767
 ] 

Deepak Majeti commented on PARQUET-1225:


[~boroknagyz] I opened a PR here 
[https://github.com/apache/parquet-cpp/pull/444] and made some comments. Let me 
know your feedback.

> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-20 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370602#comment-16370602
 ] 

Deepak Majeti commented on PARQUET-1225:


Is Impala not handling the write path?

> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1233) [CPP ]Enable option to switch between stl classes and boost classes for thrift header

2018-02-20 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1233:
--

 Summary: [CPP ]Enable option to switch between stl classes and 
boost classes for thrift header
 Key: PARQUET-1233
 URL: https://issues.apache.org/jira/browse/PARQUET-1233
 Project: Parquet
  Issue Type: Bug
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: cpp-1.4.0


Thrift 0.11.0 introduced breaking changes by defaulting to stl classes. This 
causes an issue with older thrift versions. The scope of this Jira is to enable 
an option to choose between stl and boost in parquet thrift header.

https://thrift.apache.org/lib/cpp



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-19 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1225:
--

Assignee: Deepak Majeti

> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that return false for comparisons involving a NaN. 
> This means that while garthering statistics, if a NaN is the smallest value 
> encountered so far (which happens to be the case after reading the first 
> value if that value is NaN), no other value can ever replace it, since < will 
> always be false. On the other hand, if NaN is not the first value, it won't 
> affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Imapala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1193) [CPP] Implement ColumnOrder to support min_value and max_value

2018-01-11 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1193:
--

 Summary: [CPP] Implement ColumnOrder to support min_value and 
max_value
 Key: PARQUET-1193
 URL: https://issues.apache.org/jira/browse/PARQUET-1193
 Project: Parquet
  Issue Type: Bug
Reporter: Deepak Majeti
Assignee: Deepak Majeti


Use ColumnOrder to set min_value and max_value statistics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-16 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206424#comment-16206424
 ] 

Deepak Majeti commented on PARQUET-1065:


If we treat Int96 as a primitive data type, then we must compare 
Int96(little-endian) in a reverse byte order. Then we will check the most 
significant bits first correct?

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-12 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202552#comment-16202552
 ] 

Deepak Majeti commented on PARQUET-1065:


INT96 timestamps can be sorted using both signed and unsigned sort orders.
The date values are always positive since they are Julian day numbers. 
Therefore, both orders should work.
Discussion on how the values must be compared is here: 
https://github.com/apache/parquet-format/pull/55


> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-530) Add support for LZO compression

2017-10-07 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195703#comment-16195703
 ] 

Deepak Majeti commented on PARQUET-530:
---

[~rdblue] Is the plan to deprecate/remove LZO compression from parquet format?

> Add support for LZO compression
> ---
>
> Key: PARQUET-530
> URL: https://issues.apache.org/jira/browse/PARQUET-530
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Affects Versions: cpp-1.0.0
>Reporter: Aliaksei Sandryhaila
>Assignee: Wes McKinney
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (PARQUET-1105) [CPP] Remove libboost_system dependency

2017-09-26 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1105:
--

Assignee: Deepak Majeti

> [CPP] Remove libboost_system dependency 
> 
>
> Key: PARQUET-1105
> URL: https://issues.apache.org/jira/browse/PARQUET-1105
> Project: Parquet
>  Issue Type: Bug
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
> Fix For: cpp-1.3.1
>
>
> Arrow added an additional libboost_system dependency. We now transitively 
> added the dependency to parquet for static linking. Remove it from Parquet 
> when Arrow does ARROW-1536



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1108) [C++] Fix Int96 comparators

2017-09-20 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1108:
--

 Summary: [C++] Fix Int96 comparators
 Key: PARQUET-1108
 URL: https://issues.apache.org/jira/browse/PARQUET-1108
 Project: Parquet
  Issue Type: Bug
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: cpp-1.3.0


As discussed here https://github.com/apache/parquet-format/pull/55/files
The bytes must be compared in the reverse order.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1105) [CPP] Remove libboost_system dependency

2017-09-13 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1105:
--

 Summary: [CPP] Remove libboost_system dependency 
 Key: PARQUET-1105
 URL: https://issues.apache.org/jira/browse/PARQUET-1105
 Project: Parquet
  Issue Type: Bug
Reporter: Deepak Majeti


Arrow added an additional libboost_system dependency. We now transitively added 
the dependency to parquet for static linking. Remove it from Parquet when Arrow 
does ARROW-1536



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1088) [CPP] remove parquet_version.h from version control since it gets auto generated

2017-09-05 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1088:
--

 Summary: [CPP] remove parquet_version.h from version control since 
it gets auto generated 
 Key: PARQUET-1088
 URL: https://issues.apache.org/jira/browse/PARQUET-1088
 Project: Parquet
  Issue Type: Bug
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: cpp-1.3.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (PARQUET-1075) C++: Coverage upload is broken

2017-08-08 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1075:
--

Assignee: Deepak Majeti

> C++: Coverage upload is broken
> --
>
> Key: PARQUET-1075
> URL: https://issues.apache.org/jira/browse/PARQUET-1075
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>Assignee: Deepak Majeti
> Fix For: cpp-1.3.0
>
>
> {code}
> ++which gcov-4.9
> +coveralls --gcov /usr/bin/gcov-4.9 --gcov-options '\-l' --root '' --include 
> /home/travis/build/apache/parquet-cpp --exclude 
> /home/travis/build/apache/parquet-cpp/parquet-build/thirdparty --exclude 
> /home/travis/build/apache/parquet-cpp/parquet-build/arrow_ep --exclude 
> /home/travis/build/apache/parquet-cpp/parquet-build/brotli_ep --exclude 
> /home/travis/build/apache/parquet-cpp/parquet-build/brotli_ep-prefix 
> --exclude /home/travis/build/apache/parquet-cpp/parquet-build/gbenchmark_ep 
> --exclude 
> /home/travis/build/apache/parquet-cpp/parquet-build/googletest_ep-prefix 
> --exclude /home/travis/build/apache/parquet-cpp/parquet-build/snappy_ep 
> --exclude 
> /home/travis/build/apache/parquet-cpp/parquet-build/snappy_ep-prefix 
> --exclude /home/travis/build/apache/parquet-cpp/parquet-build/zlib_ep 
> --exclude /home/travis/build/apache/parquet-cpp/parquet-build/zlib_ep-prefix 
> --exclude /home/travis/build/apache/parquet-cpp/build --exclude 
> /home/travis/build/apache/parquet-cpp/src/parquet/thrift --exclude /usr
> Traceback (most recent call last):
>   File "/usr/local/bin/coveralls", line 6, in 
> from pkg_resources import load_entry_point
>   File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", 
> line 3037, in 
> @_call_aside
>   File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", 
> line 3021, in _call_aside
> f(*args, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", 
> line 3050, in _initialize_master_working_set
> working_set = WorkingSet._build_master()
>   File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", 
> line 655, in _build_master
> ws.require(__requires__)
>   File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", 
> line 969, in require
> needed = self.resolve(parse_requirements(requirements))
>   File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", 
> line 863, in resolve
> new_requirements = dist.requires(req.extras)[::-1]
>   File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", 
> line 2577, in requires
> "%s has no such extra feature %r" % (self, ext)
> pkg_resources.UnknownExtra: urllib3 1.7.1 has no such extra feature 'secure'
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1048) [C++] Static linking of libarrow is no longer supported

2017-07-10 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080412#comment-16080412
 ] 

Deepak Majeti commented on PARQUET-1048:


Updated PR: https://github.com/apache/parquet-cpp/pull/367

> [C++] Static linking of libarrow is no longer supported
> ---
>
> Key: PARQUET-1048
> URL: https://issues.apache.org/jira/browse/PARQUET-1048
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: cpp-1.2.0
>
>
> Since the compression libraries were moved to Apache Arrow, static linking 
> requires pulling in the transitive dependencies. Unclear if we want to keep 
> supporting this, or if so the best way to do it (possibly via external 
> project)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1052) [C++] add_compiler_export_flags() throws warning with CMake >= 3.3

2017-07-08 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1052:
--

 Summary: [C++] add_compiler_export_flags() throws warning with 
CMake >= 3.3
 Key: PARQUET-1052
 URL: https://issues.apache.org/jira/browse/PARQUET-1052
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Deepak Majeti
Assignee: Deepak Majeti


The following Warning is shown
CMake Deprecation Warning at 
/usr/share/cmake-3.5/Modules/GenerateExportHeader.cmake:383 (message):
  The add_compiler_export_flags function is obsolete.  Use the
  CXX_VISIBILITY_PRESET and VISIBILITY_INLINES_HIDDEN target properties
  instead.
Call Stack (most recent call first):
  CMakeLists.txt:437 (add_compiler_export_flags)

Similar problem in KUDU: https://issues.apache.org/jira/browse/KUDU-1390




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-1048) [C++] Static linking of libarrow is no longer supported

2017-06-30 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070806#comment-16070806
 ] 

Deepak Majeti edited comment on PARQUET-1048 at 6/30/17 10:36 PM:
--

I briefly looked at posts that suggest CMake export targets to handle 
transitive dependencies. I will take another look if it suits our needs here.

If we leave the interfaces to compression libraries in Arrow, we have to ensure 
Parquet uses the same compression libraries versions to avoid ABI 
incompatibility issues. 



was (Author: mdeepak):
I briefly looked at posts that suggest CMake export targets to handle 
transitive dependencies. I will take another look if it suits our needs here.

If we leave the interfaces to compression libraries in Arrow, we have to ensure 
Parquet uses the same compression library versions to avoid ABI incompatibility 
issues. 


> [C++] Static linking of libarrow is no longer supported
> ---
>
> Key: PARQUET-1048
> URL: https://issues.apache.org/jira/browse/PARQUET-1048
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
> Fix For: cpp-1.2.0
>
>
> Since the compression libraries were moved to Apache Arrow, static linking 
> requires pulling in the transitive dependencies. Unclear if we want to keep 
> supporting this, or if so the best way to do it (possibly via external 
> project)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1048) [C++] Static linking of libarrow is no longer supported

2017-06-30 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070806#comment-16070806
 ] 

Deepak Majeti commented on PARQUET-1048:


I briefly looked at posts that suggest CMake export targets to handle 
transitive dependencies. I will take another look if it suits our needs here.

If we leave the interfaces to compression libraries in Arrow, we have to ensure 
Parquet uses the same compression library versions to avoid ABI incompatibility 
issues. 


> [C++] Static linking of libarrow is no longer supported
> ---
>
> Key: PARQUET-1048
> URL: https://issues.apache.org/jira/browse/PARQUET-1048
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
> Fix For: cpp-1.2.0
>
>
> Since the compression libraries were moved to Apache Arrow, static linking 
> requires pulling in the transitive dependencies. Unclear if we want to keep 
> supporting this, or if so the best way to do it (possibly via external 
> project)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1048) [C++] Static linking of libarrow is no longer supported

2017-06-28 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067697#comment-16067697
 ] 

Deepak Majeti commented on PARQUET-1048:


We definitely have to support static linking. Can we retain the previous 
linking with compression libraries via external project? 

> [C++] Static linking of libarrow is no longer supported
> ---
>
> Key: PARQUET-1048
> URL: https://issues.apache.org/jira/browse/PARQUET-1048
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
> Fix For: cpp-1.2.0
>
>
> Since the compression libraries were moved to Apache Arrow, static linking 
> requires pulling in the transitive dependencies. Unclear if we want to keep 
> supporting this, or if so the best way to do it (possibly via external 
> project)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (PARQUET-1007) [C++ ] Update parquet.thrift from https://github.com/apache/parquet-format

2017-06-17 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1007:
--

Assignee: Deepak Majeti

> [C++ ] Update parquet.thrift from https://github.com/apache/parquet-format
> --
>
> Key: PARQUET-1007
> URL: https://issues.apache.org/jira/browse/PARQUET-1007
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
> Fix For: cpp-1.2.0
>
>
> Support recent format changes including
> 1) PARQUET-906: Add LogicalType annotation (yet to commit)
> 2) PARQUET-686: Add Order to store the order used for min/max stat



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (PARQUET-1012) parquet-cpp and parquet-mr version parse inconsistent

2017-06-13 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-1012.

   Resolution: Fixed
Fix Version/s: 1.9.0

> parquet-cpp and parquet-mr version parse inconsistent
> -
>
> Key: PARQUET-1012
> URL: https://issues.apache.org/jira/browse/PARQUET-1012
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Deepak Majeti
> Fix For: 1.9.0
>
>
> Spark 2.1 uses parquet-mr(common) 1.8.2 which requires created_by to match 
> certain pattern.  I found the following exception when using spark to read 
> parquet file generated by parquet-cpp.
> 17/05/31 16:33:53 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.0.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.0.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> Proposal to fix this issue:  set created_by to the certain pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1012) parquet-cpp and parquet-mr version parse inconsistent

2017-06-03 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16036070#comment-16036070
 ] 

Deepak Majeti commented on PARQUET-1012:


If you just fix the parsing part, you will end up using incorrect statistics. 
parquet-cpp versions 1.0.0 and 1.1.0 do not compute the correct unsigned binary 
statistics like parquet-mr (PARQUET-686). However, parquet-mr ignores these 
statistics if the writer is only parquet-mr. PARQUET-1017 extends this to 
parquet-cpp.

> parquet-cpp and parquet-mr version parse inconsistent
> -
>
> Key: PARQUET-1012
> URL: https://issues.apache.org/jira/browse/PARQUET-1012
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Deepak Majeti
>
> Spark 2.1 uses parquet-mr(common) 1.8.2 which requires created_by to match 
> certain pattern.  I found the following exception when using spark to read 
> parquet file generated by parquet-cpp.
> 17/05/31 16:33:53 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.0.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.0.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> Proposal to fix this issue:  set created_by to the certain pattern.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-1017) parquet-mr must handle statistics in files written by parquet-cpp

2017-06-02 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1017:
---
Summary: parquet-mr must handle statistics in files written by parquet-cpp  
(was: Ignore binary statistics in parquet-mr from files written by parquet-cpp)

> parquet-mr must handle statistics in files written by parquet-cpp
> -
>
> Key: PARQUET-1017
> URL: https://issues.apache.org/jira/browse/PARQUET-1017
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Deepak Majeti
>
> {{shouldIgnoreStatistics}} in parquet-mr always accepts statistics for files 
> written by other applications including parquet-cpp. We must fix this to 
> correctly handle statistics written by parquet-cpp versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-1017) Ignore binary statistics in parquet-mr from files written by parquet-cpp

2017-06-02 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1017:
--

 Summary: Ignore binary statistics in parquet-mr from files written 
by parquet-cpp
 Key: PARQUET-1017
 URL: https://issues.apache.org/jira/browse/PARQUET-1017
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Deepak Majeti


{{shouldIgnoreStatistics}} in parquet-mr always accepts statistics for files 
written by other applications including parquet-cpp. We must fix this to 
correctly handle statistics written by parquet-cpp versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-1012) parquet-cpp and parquet-mr version parse inconsistent

2017-06-02 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035493#comment-16035493
 ] 

Deepak Majeti commented on PARQUET-1012:


Looking at {{shouldIgnoreStatistics}}, parquet-mr always accepts statistics for 
files written by other applications including parquet-cpp. We must fix this for 
parquet-cpp atleast.

> parquet-cpp and parquet-mr version parse inconsistent
> -
>
> Key: PARQUET-1012
> URL: https://issues.apache.org/jira/browse/PARQUET-1012
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Deepak Majeti
>
> Spark 2.1 uses parquet-mr(common) 1.8.2 which requires created_by to match 
> certain pattern.  I found the following exception when using spark to read 
> parquet file generated by parquet-cpp.
> 17/05/31 16:33:53 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.0.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.0.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> Proposal to fix this issue:  set created_by to the certain pattern.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-1012) parquet-cpp and parquet-mr version parse inconsistent

2017-06-02 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035485#comment-16035485
 ] 

Deepak Majeti commented on PARQUET-1012:


I could not reproduce this with parquet-mr 1.8.2.
The error seems to originate from {{shouldIgnoreStatistics}} in 
{{parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java:52}}.
I added a test in 
{{parquet-column/src/test/java/org/apache/parquet/CorruptStatisticsTest.java}} 
with the {{parquet-cpp version 1.0.0}} string and it works fine.
Can you share the parquet-cpp file?

> parquet-cpp and parquet-mr version parse inconsistent
> -
>
> Key: PARQUET-1012
> URL: https://issues.apache.org/jira/browse/PARQUET-1012
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Deepak Majeti
>
> Spark 2.1 uses parquet-mr(common) 1.8.2 which requires created_by to match 
> certain pattern.  I found the following exception when using spark to read 
> parquet file generated by parquet-cpp.
> 17/05/31 16:33:53 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.0.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.0.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> Proposal to fix this issue:  set created_by to the certain pattern.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-1014) Example for multiple row group writer (cpp)

2017-06-01 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033781#comment-16033781
 ] 

Deepak Majeti commented on PARQUET-1014:


Hi [~elderrex],
The example at {{examples/reader-writer.cc:}} adds a single rowgroup. To add 
another rowgroup, you just need to call 
{{file_writer->AppendRowGroup(NUM_ROWS_PER_ROW_GROUP)}} at 
{{examples/reader-writer.cc:123}} again.

You will need def/rep levels for all types including the primitive types like 
int64_t.

Do you have a reproducer for the crash?
Thanks!

> Example for multiple row group writer (cpp)
> ---
>
> Key: PARQUET-1014
> URL: https://issues.apache.org/jira/browse/PARQUET-1014
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Affects Versions: cpp-1.1.0
>Reporter: yugu
> Fix For: cpp-1.1.0
>
>
> Been looking through the repo and cannot find an example for multiple row 
> group writer.
> Probably missed that. Would be great if you guys can point it out ! : D
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (PARQUET-1012) parquet-cpp and parquet-mr version parse inconsistent

2017-05-31 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1012:
--

Assignee: Deepak Majeti

> parquet-cpp and parquet-mr version parse inconsistent
> -
>
> Key: PARQUET-1012
> URL: https://issues.apache.org/jira/browse/PARQUET-1012
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Deepak Majeti
>
> Spark 2.1 uses parquet-mr(common) 1.8.2 which requires created_by to match 
> certain pattern.  I found the following exception when using spark to read 
> parquet file generated by parquet-cpp.
> 17/05/31 16:33:53 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.0.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.0.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> Proposal to fix this issue:  set created_by to the certain pattern.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-1012) parquet-cpp and parquet-mr version parse inconsistent

2017-05-31 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031383#comment-16031383
 ] 

Deepak Majeti commented on PARQUET-1012:


I can take a look at this as part of PARQUET-1003.

> parquet-cpp and parquet-mr version parse inconsistent
> -
>
> Key: PARQUET-1012
> URL: https://issues.apache.org/jira/browse/PARQUET-1012
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>
> Spark 2.1 uses parquet-mr(common) 1.8.2 which requires created_by to match 
> certain pattern.  I found the following exception when using spark to read 
> parquet file generated by parquet-cpp.
> 17/05/31 16:33:53 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.0.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.0.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> Proposal to fix this issue:  set created_by to the certain pattern.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-1007) [C++ ] Update parquet.thrift from https://github.com/apache/parquet-format

2017-05-26 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1007:
--

 Summary: [C++ ] Update parquet.thrift from 
https://github.com/apache/parquet-format
 Key: PARQUET-1007
 URL: https://issues.apache.org/jira/browse/PARQUET-1007
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Deepak Majeti
 Fix For: cpp-1.2.0


Support recent format changes including
1) PARQUET-906: Add LogicalType annotation (yet to commit)
2) PARQUET-686: Add Order to store the order used for min/max stat



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-699) Update parquet.thrift from https://github.com/apache/parquet-format

2017-05-26 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-699.
---
   Resolution: Fixed
Fix Version/s: (was: cpp-1.2.0)
   cpp-1.0.0

> Update parquet.thrift from https://github.com/apache/parquet-format
> ---
>
> Key: PARQUET-699
> URL: https://issues.apache.org/jira/browse/PARQUET-699
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Florian Scheibner
>Assignee: Deepak Majeti
> Fix For: cpp-1.0.0
>
>
> Support logical types TIME_MICROS and TIMESTAMP_MICROS
> Also the current code was incorrect. Parquet reserved the LogicalTypes 8 and 
> 10, but those were completely omitted in types.h. So types with greater 
> indices were mapped incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-1003) [C++] Modify DEFAULT_CREATED_BY value for every new release version

2017-05-25 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1003:
--

 Summary: [C++] Modify DEFAULT_CREATED_BY value for every new 
release version
 Key: PARQUET-1003
 URL: https://issues.apache.org/jira/browse/PARQUET-1003
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: cpp-1.2.0


For every new release version, we should modify the {{DEFAULT_CREATED_BY}} 
value at {{src/parquet/column/properties.h:88}} accordingly.
We should automate this update.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-1002) [C++] Compute statistics based on Logical Types

2017-05-25 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1002:
--

 Summary: [C++] Compute statistics based on Logical Types
 Key: PARQUET-1002
 URL: https://issues.apache.org/jira/browse/PARQUET-1002
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: cpp-1.2.0


Current implementation computes statistics based on the physical type. We need 
to consider the logical type first if specified.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


  1   2   3   >