[jira] [Resolved] (PARQUET-672) [C++] Build testing conda artifacts in debug mode

2016-08-03 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-672.
--
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 141
[https://github.com/apache/parquet-cpp/pull/141]

> [C++] Build testing conda artifacts in debug mode
> -
>
> Key: PARQUET-672
> URL: https://issues.apache.org/jira/browse/PARQUET-672
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
> Fix For: cpp-0.1
>
>
> The intent of this is to help us fix the broken build being discussed in 
> ARROW-247
> https://github.com/apache/arrow/pull/111



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-672) [C++] Build testing conda artifacts in debug mode

2016-08-03 Thread Wes McKinney (JIRA)
Wes McKinney created PARQUET-672:


 Summary: [C++] Build testing conda artifacts in debug mode
 Key: PARQUET-672
 URL: https://issues.apache.org/jira/browse/PARQUET-672
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Wes McKinney


The intent of this is to help us fix the broken build being discussed in 
ARROW-247

https://github.com/apache/arrow/pull/111



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-669) Allow reading file footers from input streams when writing metadata files

2016-08-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-669.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 357
https://github.com/apache/parquet-mr/pull/357


> Allow reading file footers from input streams when writing metadata files
> -
>
> Key: PARQUET-669
> URL: https://issues.apache.org/jira/browse/PARQUET-669
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Robert Kruszewski
>Assignee: Robert Kruszewski
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-669) Allow reading file footers from input streams when writing metadata files

2016-08-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-669:
--
Assignee: Robert Kruszewski

> Allow reading file footers from input streams when writing metadata files
> -
>
> Key: PARQUET-669
> URL: https://issues.apache.org/jira/browse/PARQUET-669
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Robert Kruszewski
>Assignee: Robert Kruszewski
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-668) Provide option to disable auto crop feature in DumpCommand output

2016-08-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-668.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 358
[https://github.com/apache/parquet-mr/pull/358]

> Provide option to disable auto crop feature in DumpCommand output
> -
>
> Key: PARQUET-668
> URL: https://issues.apache.org/jira/browse/PARQUET-668
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Daniel Harper
>Priority: Trivial
> Fix For: 1.9.0
>
>
> *Problem*
> When using the {{dump}} command in {{parquet-tools}}, the output will 
> sometimes be truncated based on the width of your console, especially on 
> smaller displays.
> Example:
> {code}
> row group 0
> 
> id:  INT32 SNAPPY DO:0 FPO:4 SZ:44668/920538/20.61 VC:7240100  
> [more]...
> name:BINARY SNAPPY DO:0 FPO:44672 SZ:89464018/1031768430/11.53 
> [more]...
> event_time:  INT64 SNAPPY DO:0 FPO:89508690 SZ:43600235/57923935/1.33 
> VC:7240100 [more]...
> id TV=7240100 RL=0 DL=0 DS: 2 DE:PLAIN_DICTIONARY
> 
> 
> page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLA 
> [more]... SZ:33291
> {code}
> This is especially annoying if you pipe the output to a file as the 
> truncation remains in place. 
> *Proposed fix*
> Provide the flag {{--disable-crop}} for the dump command. Truncation is 
> enabled by default and will only be disabled when this flag is provided,
> This will output the full content to standard out, for example:
> {code}
> row group 0
> 
> id:  INT32 SNAPPY DO:0 FPO:4 SZ:44668/920538/20.61 VC:7240100 
> ENC:BIT_PACKED,PLAIN_DICTIONARY
> name:BINARY SNAPPY DO:0 FPO:44672 SZ:89464018/1031768430/11.53 
> VC:7240100 ENC:PLAIN,BIT_PACKED
> event_time:  INT64 SNAPPY DO:0 FPO:89508690 SZ:43600235/57923935/1.33 
> VC:7240100 ENC:PLAIN,BIT_PACKED,RLE
> id TV=7240100 RL=0 DL=0 DS: 2 DE:PLAIN_DICTIONARY
> 
> 
> page 0:  DLE:BIT_PACKED RLE:BIT_PACKED 
> VLE:PLAIN_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] SZ:33291 VC:262146
> page 1:  DLE:BIT_PACKED RLE:BIT_PACKED 
> VLE:PLAIN_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] SZ:33291 VC:262145
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-323) INT96 should be marked as deprecated

2016-08-03 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406334#comment-15406334
 ] 

Ryan Blue commented on PARQUET-323:
---

+1

> INT96 should be marked as deprecated
> 
>
> Key: PARQUET-323
> URL: https://issues.apache.org/jira/browse/PARQUET-323
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>
> As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
> timestamp in Impala for some historical reasons, and should be deprecated. 
> Since nanosec precision is rarely a real requirement, one possible and simple 
> solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
> {{INT64 (TIMESTAMP_MICROS)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (PARQUET-323) INT96 should be marked as deprecated

2016-08-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-323:
--
Comment: was deleted

(was: I think we should deprecate it and discourage its use. For backward 
compatibility, it has to stay.
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md doesn't 
even refer to it.
)

> INT96 should be marked as deprecated
> 
>
> Key: PARQUET-323
> URL: https://issues.apache.org/jira/browse/PARQUET-323
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>
> As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
> timestamp in Impala for some historical reasons, and should be deprecated. 
> Since nanosec precision is rarely a real requirement, one possible and simple 
> solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
> {{INT64 (TIMESTAMP_MICROS)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-323) INT96 should be marked as deprecated

2016-08-03 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406330#comment-15406330
 ] 

Julien Le Dem commented on PARQUET-323:
---

I think we should deprecate it and discourage its use. For backward 
compatibility, it has to stay.
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md doesn't 
even refer to it.


> INT96 should be marked as deprecated
> 
>
> Key: PARQUET-323
> URL: https://issues.apache.org/jira/browse/PARQUET-323
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>
> As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
> timestamp in Impala for some historical reasons, and should be deprecated. 
> Since nanosec precision is rarely a real requirement, one possible and simple 
> solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
> {{INT64 (TIMESTAMP_MICROS)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Parquet for High Energy Physics

2016-08-03 Thread Wes McKinney
hi Jim

Cool to hear about this use case. My gut feeling is that we should not
expand the scope of the parquet-cpp library itself too much beyond the
computational details of constructing the encoded streams / metadata
and writing to a file stream or decoding a file into the raw values
stored in each column.

We could potentially create adapter code to convert between Parquet
raw (arrays of data page values, repetition, and definition levels)
and Avro/Protobuf data structures.

What we've done in Arrow, since we will need a generic IO subsystem
for many tasks (for interacting with HDFS or other blob stores), is
put all of this in leaf libraries in apache/arrow (see arrow::io and
arrow::parquet namespaces). There isn't really the equivalent of a
Boost for C++ Apache projects, so arrow::io seemed like a fine place
to put them.

I'm getting back to SF from an international trip on the 16th but I
can meet with you in the later part of the day, and anyone else is
welcome to join to discuss.

- Wes

On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Dem  wrote:
> Yes that would be another way to do it.
> The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are closely 
> related.
> Julien
>
>> On Aug 3, 2016, at 9:41 AM, Jim Pivarski  wrote:
>>
>> Related question: could I get ROOT's complex events into Parquet files
>> without inventing a Logical Type Definition by converting them to Apache
>> Arrow data structures in memory, and then letting the Arrow-Parquet
>> integration write those data structures to files?
>>
>> Arrow could provide side-benefits, such as sharing data between ROOT's C++
>> framework and JVM-based applications without intermediate files through the
>> JNI. (Two birds with one stone.)
>>
>> -- Jim
>


Re: Parquet for High Energy Physics

2016-08-03 Thread Julien Le Dem
I’m CC’ing the parquet-cpp main contributors. You are right that there has been 
a lot of progress recently.
(some of the goals are to provide Python and Vertica access to Parquet files)
I’ll let them comment on the progress.
You’re certainly welcome to contribute.
Julien

> On Aug 3, 2016, at 8:00 AM, Jim Pivarski  wrote:
> 
> Hi,
> 
> I'd like to use parquet-cpp for High Energy Physics (HEP) and possibly
> contribute to the core to support that use-case, but I'm having trouble
> determining the status of the C++ project.
> 
> Most HEP data is stored in the ROOT file format (
> https://root.cern.ch/root/InputOutput.html), which represents complex,
> nested, cross-referenced C++ objects with a columnar layout so that a
> subset of fields can be individually read, individually compressed, and
> quickly scanned. I believe that these benefits can be satisfied by Parquet,
> with the additional benefit that it's a standard with a specification that
> can be read or written in multiple languages. (Parquet can't be used as a
> random-writable object database, but this feature of ROOT isn't widely
> used.)
> 
> To convert between ROOT and Parquet, I would need to implement ROOT's
> "StreamerInfo" object schema (https://root.cern.ch/root/SchemaEvolution.html)
> into a Logical Type Definition, on par with AvroRecordReader, but also
> supporting pointer references (as an Int64 -> object map).
> 
> Parquet C++'s TODO (https://github.com/apache/parquet-cpp/blob/master/TODO)
> states that this record abstraction, as well as nested schemas and
> file-writing, haven't been implemented. However, the TODO is also 2 years
> old, where I see a burst of activity this year in GitHub. Is the TODO out
> of date?
> 
> Will any of the core developers be at KDD16 (http://www.kdd.org/kdd2016/)
> or elsewhere in San Francisco on August 15 or 16? If so, could we meet in
> person so that we can talk in detail about where the hooks I'm looking for
> are and how I can contribute? (Or *when* I should contribute, if there's a
> major refactoring in the works.)
> 
> Thanks!
> -- Jim



Re: Parquet for High Energy Physics

2016-08-03 Thread Jim Pivarski
Related question: could I get ROOT's complex events into Parquet files
without inventing a Logical Type Definition by converting them to Apache
Arrow data structures in memory, and then letting the Arrow-Parquet
integration write those data structures to files?

Arrow could provide side-benefits, such as sharing data between ROOT's C++
framework and JVM-based applications without intermediate files through the
JNI. (Two birds with one stone.)

-- Jim


Parquet for High Energy Physics

2016-08-03 Thread Jim Pivarski
Hi,

I'd like to use parquet-cpp for High Energy Physics (HEP) and possibly
contribute to the core to support that use-case, but I'm having trouble
determining the status of the C++ project.

Most HEP data is stored in the ROOT file format (
https://root.cern.ch/root/InputOutput.html), which represents complex,
nested, cross-referenced C++ objects with a columnar layout so that a
subset of fields can be individually read, individually compressed, and
quickly scanned. I believe that these benefits can be satisfied by Parquet,
with the additional benefit that it's a standard with a specification that
can be read or written in multiple languages. (Parquet can't be used as a
random-writable object database, but this feature of ROOT isn't widely
used.)

To convert between ROOT and Parquet, I would need to implement ROOT's
"StreamerInfo" object schema (https://root.cern.ch/root/SchemaEvolution.html)
into a Logical Type Definition, on par with AvroRecordReader, but also
supporting pointer references (as an Int64 -> object map).

Parquet C++'s TODO (https://github.com/apache/parquet-cpp/blob/master/TODO)
states that this record abstraction, as well as nested schemas and
file-writing, haven't been implemented. However, the TODO is also 2 years
old, where I see a burst of activity this year in GitHub. Is the TODO out
of date?

Will any of the core developers be at KDD16 (http://www.kdd.org/kdd2016/)
or elsewhere in San Francisco on August 15 or 16? If so, could we meet in
person so that we can talk in detail about where the hooks I'm looking for
are and how I can contribute? (Or *when* I should contribute, if there's a
major refactoring in the works.)

Thanks!
-- Jim