[jira] [Resolved] (PARQUET-672) [C++] Build testing conda artifacts in debug mode
[ https://issues.apache.org/jira/browse/PARQUET-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved PARQUET-672. -- Resolution: Fixed Fix Version/s: cpp-0.1 Issue resolved by pull request 141 [https://github.com/apache/parquet-cpp/pull/141] > [C++] Build testing conda artifacts in debug mode > - > > Key: PARQUET-672 > URL: https://issues.apache.org/jira/browse/PARQUET-672 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney > Fix For: cpp-0.1 > > > The intent of this is to help us fix the broken build being discussed in > ARROW-247 > https://github.com/apache/arrow/pull/111 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-672) [C++] Build testing conda artifacts in debug mode
Wes McKinney created PARQUET-672: Summary: [C++] Build testing conda artifacts in debug mode Key: PARQUET-672 URL: https://issues.apache.org/jira/browse/PARQUET-672 Project: Parquet Issue Type: Bug Components: parquet-cpp Reporter: Wes McKinney The intent of this is to help us fix the broken build being discussed in ARROW-247 https://github.com/apache/arrow/pull/111 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-669) Allow reading file footers from input streams when writing metadata files
[ https://issues.apache.org/jira/browse/PARQUET-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem resolved PARQUET-669. --- Resolution: Fixed Fix Version/s: 1.9.0 Issue resolved by pull request 357 https://github.com/apache/parquet-mr/pull/357 > Allow reading file footers from input streams when writing metadata files > - > > Key: PARQUET-669 > URL: https://issues.apache.org/jira/browse/PARQUET-669 > Project: Parquet > Issue Type: New Feature >Reporter: Robert Kruszewski >Assignee: Robert Kruszewski > Fix For: 1.9.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-669) Allow reading file footers from input streams when writing metadata files
[ https://issues.apache.org/jira/browse/PARQUET-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem updated PARQUET-669: -- Assignee: Robert Kruszewski > Allow reading file footers from input streams when writing metadata files > - > > Key: PARQUET-669 > URL: https://issues.apache.org/jira/browse/PARQUET-669 > Project: Parquet > Issue Type: New Feature >Reporter: Robert Kruszewski >Assignee: Robert Kruszewski > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-668) Provide option to disable auto crop feature in DumpCommand output
[ https://issues.apache.org/jira/browse/PARQUET-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem resolved PARQUET-668. --- Resolution: Fixed Fix Version/s: 1.9.0 Issue resolved by pull request 358 [https://github.com/apache/parquet-mr/pull/358] > Provide option to disable auto crop feature in DumpCommand output > - > > Key: PARQUET-668 > URL: https://issues.apache.org/jira/browse/PARQUET-668 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Daniel Harper >Priority: Trivial > Fix For: 1.9.0 > > > *Problem* > When using the {{dump}} command in {{parquet-tools}}, the output will > sometimes be truncated based on the width of your console, especially on > smaller displays. > Example: > {code} > row group 0 > > id: INT32 SNAPPY DO:0 FPO:4 SZ:44668/920538/20.61 VC:7240100 > [more]... > name:BINARY SNAPPY DO:0 FPO:44672 SZ:89464018/1031768430/11.53 > [more]... > event_time: INT64 SNAPPY DO:0 FPO:89508690 SZ:43600235/57923935/1.33 > VC:7240100 [more]... > id TV=7240100 RL=0 DL=0 DS: 2 DE:PLAIN_DICTIONARY > > > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLA > [more]... SZ:33291 > {code} > This is especially annoying if you pipe the output to a file as the > truncation remains in place. > *Proposed fix* > Provide the flag {{--disable-crop}} for the dump command. Truncation is > enabled by default and will only be disabled when this flag is provided, > This will output the full content to standard out, for example: > {code} > row group 0 > > id: INT32 SNAPPY DO:0 FPO:4 SZ:44668/920538/20.61 VC:7240100 > ENC:BIT_PACKED,PLAIN_DICTIONARY > name:BINARY SNAPPY DO:0 FPO:44672 SZ:89464018/1031768430/11.53 > VC:7240100 ENC:PLAIN,BIT_PACKED > event_time: INT64 SNAPPY DO:0 FPO:89508690 SZ:43600235/57923935/1.33 > VC:7240100 ENC:PLAIN,BIT_PACKED,RLE > id TV=7240100 RL=0 DL=0 DS: 2 DE:PLAIN_DICTIONARY > > > page 0: DLE:BIT_PACKED RLE:BIT_PACKED > VLE:PLAIN_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] SZ:33291 VC:262146 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED > VLE:PLAIN_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] SZ:33291 VC:262145 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-323) INT96 should be marked as deprecated
[ https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406334#comment-15406334 ] Ryan Blue commented on PARQUET-323: --- +1 > INT96 should be marked as deprecated > > > Key: PARQUET-323 > URL: https://issues.apache.org/jira/browse/PARQUET-323 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian > > As discussed in the mailing list, {{INT96}} is only used to represent nanosec > timestamp in Impala for some historical reasons, and should be deprecated. > Since nanosec precision is rarely a real requirement, one possible and simple > solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or > {{INT64 (TIMESTAMP_MICROS)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (PARQUET-323) INT96 should be marked as deprecated
[ https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem updated PARQUET-323: -- Comment: was deleted (was: I think we should deprecate it and discourage its use. For backward compatibility, it has to stay. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md doesn't even refer to it. ) > INT96 should be marked as deprecated > > > Key: PARQUET-323 > URL: https://issues.apache.org/jira/browse/PARQUET-323 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian > > As discussed in the mailing list, {{INT96}} is only used to represent nanosec > timestamp in Impala for some historical reasons, and should be deprecated. > Since nanosec precision is rarely a real requirement, one possible and simple > solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or > {{INT64 (TIMESTAMP_MICROS)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-323) INT96 should be marked as deprecated
[ https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406330#comment-15406330 ] Julien Le Dem commented on PARQUET-323: --- I think we should deprecate it and discourage its use. For backward compatibility, it has to stay. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md doesn't even refer to it. > INT96 should be marked as deprecated > > > Key: PARQUET-323 > URL: https://issues.apache.org/jira/browse/PARQUET-323 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian > > As discussed in the mailing list, {{INT96}} is only used to represent nanosec > timestamp in Impala for some historical reasons, and should be deprecated. > Since nanosec precision is rarely a real requirement, one possible and simple > solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or > {{INT64 (TIMESTAMP_MICROS)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Parquet for High Energy Physics
hi Jim Cool to hear about this use case. My gut feeling is that we should not expand the scope of the parquet-cpp library itself too much beyond the computational details of constructing the encoded streams / metadata and writing to a file stream or decoding a file into the raw values stored in each column. We could potentially create adapter code to convert between Parquet raw (arrays of data page values, repetition, and definition levels) and Avro/Protobuf data structures. What we've done in Arrow, since we will need a generic IO subsystem for many tasks (for interacting with HDFS or other blob stores), is put all of this in leaf libraries in apache/arrow (see arrow::io and arrow::parquet namespaces). There isn't really the equivalent of a Boost for C++ Apache projects, so arrow::io seemed like a fine place to put them. I'm getting back to SF from an international trip on the 16th but I can meet with you in the later part of the day, and anyone else is welcome to join to discuss. - Wes On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Demwrote: > Yes that would be another way to do it. > The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are closely > related. > Julien > >> On Aug 3, 2016, at 9:41 AM, Jim Pivarski wrote: >> >> Related question: could I get ROOT's complex events into Parquet files >> without inventing a Logical Type Definition by converting them to Apache >> Arrow data structures in memory, and then letting the Arrow-Parquet >> integration write those data structures to files? >> >> Arrow could provide side-benefits, such as sharing data between ROOT's C++ >> framework and JVM-based applications without intermediate files through the >> JNI. (Two birds with one stone.) >> >> -- Jim >
Re: Parquet for High Energy Physics
I’m CC’ing the parquet-cpp main contributors. You are right that there has been a lot of progress recently. (some of the goals are to provide Python and Vertica access to Parquet files) I’ll let them comment on the progress. You’re certainly welcome to contribute. Julien > On Aug 3, 2016, at 8:00 AM, Jim Pivarskiwrote: > > Hi, > > I'd like to use parquet-cpp for High Energy Physics (HEP) and possibly > contribute to the core to support that use-case, but I'm having trouble > determining the status of the C++ project. > > Most HEP data is stored in the ROOT file format ( > https://root.cern.ch/root/InputOutput.html), which represents complex, > nested, cross-referenced C++ objects with a columnar layout so that a > subset of fields can be individually read, individually compressed, and > quickly scanned. I believe that these benefits can be satisfied by Parquet, > with the additional benefit that it's a standard with a specification that > can be read or written in multiple languages. (Parquet can't be used as a > random-writable object database, but this feature of ROOT isn't widely > used.) > > To convert between ROOT and Parquet, I would need to implement ROOT's > "StreamerInfo" object schema (https://root.cern.ch/root/SchemaEvolution.html) > into a Logical Type Definition, on par with AvroRecordReader, but also > supporting pointer references (as an Int64 -> object map). > > Parquet C++'s TODO (https://github.com/apache/parquet-cpp/blob/master/TODO) > states that this record abstraction, as well as nested schemas and > file-writing, haven't been implemented. However, the TODO is also 2 years > old, where I see a burst of activity this year in GitHub. Is the TODO out > of date? > > Will any of the core developers be at KDD16 (http://www.kdd.org/kdd2016/) > or elsewhere in San Francisco on August 15 or 16? If so, could we meet in > person so that we can talk in detail about where the hooks I'm looking for > are and how I can contribute? (Or *when* I should contribute, if there's a > major refactoring in the works.) > > Thanks! > -- Jim
Re: Parquet for High Energy Physics
Related question: could I get ROOT's complex events into Parquet files without inventing a Logical Type Definition by converting them to Apache Arrow data structures in memory, and then letting the Arrow-Parquet integration write those data structures to files? Arrow could provide side-benefits, such as sharing data between ROOT's C++ framework and JVM-based applications without intermediate files through the JNI. (Two birds with one stone.) -- Jim
Parquet for High Energy Physics
Hi, I'd like to use parquet-cpp for High Energy Physics (HEP) and possibly contribute to the core to support that use-case, but I'm having trouble determining the status of the C++ project. Most HEP data is stored in the ROOT file format ( https://root.cern.ch/root/InputOutput.html), which represents complex, nested, cross-referenced C++ objects with a columnar layout so that a subset of fields can be individually read, individually compressed, and quickly scanned. I believe that these benefits can be satisfied by Parquet, with the additional benefit that it's a standard with a specification that can be read or written in multiple languages. (Parquet can't be used as a random-writable object database, but this feature of ROOT isn't widely used.) To convert between ROOT and Parquet, I would need to implement ROOT's "StreamerInfo" object schema (https://root.cern.ch/root/SchemaEvolution.html) into a Logical Type Definition, on par with AvroRecordReader, but also supporting pointer references (as an Int64 -> object map). Parquet C++'s TODO (https://github.com/apache/parquet-cpp/blob/master/TODO) states that this record abstraction, as well as nested schemas and file-writing, haven't been implemented. However, the TODO is also 2 years old, where I see a burst of activity this year in GitHub. Is the TODO out of date? Will any of the core developers be at KDD16 (http://www.kdd.org/kdd2016/) or elsewhere in San Francisco on August 15 or 16? If so, could we meet in person so that we can talk in detail about where the hooks I'm looking for are and how I can contribute? (Or *when* I should contribute, if there's a major refactoring in the works.) Thanks! -- Jim