[jira] [Commented] (PARQUET-518) Review usages of size_t and unsigned integers generally per Google style guide

2016-02-26 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170246#comment-15170246
 ] 

Wes McKinney commented on PARQUET-518:
--

Patch available here https://github.com/apache/parquet-cpp/pull/63

> Review usages of size_t and unsigned integers generally per Google style guide
> --
>
> Key: PARQUET-518
> URL: https://issues.apache.org/jira/browse/PARQUET-518
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Minor
>
> The Google style guide recommends generally avoiding unsigned integers for 
> the bugs they can silently introduce. 
> https://google.github.io/styleguide/cppguide.html#Integer_Types



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-539) Enable include_order cpplint check

2016-02-26 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-539.
--
Resolution: Fixed
  Assignee: Wes McKinney

done in 
https://github.com/apache/parquet-cpp/commit/c6e069297a3b8d0f9ad45da04fe114d40c593115

> Enable include_order cpplint check
> --
>
> Key: PARQUET-539
> URL: https://issues.apache.org/jira/browse/PARQUET-539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> This will help keep our includes organized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-493) Adapt DictEncoder from Impala (or implement a new one) and unit test

2016-02-26 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-493.
--
Resolution: Fixed

Complete in 
https://github.com/apache/parquet-cpp/commit/c6e069297a3b8d0f9ad45da04fe114d40c593115

> Adapt DictEncoder from Impala (or implement a new one) and unit test
> 
>
> Key: PARQUET-493
> URL: https://issues.apache.org/jira/browse/PARQUET-493
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> Only decoding is available in the library at the moment. Without this, we 
> cannot generate dictionary-encoded data pages for unit testing purposes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: parquet-cpp first 0.1 release planning & timeline

2016-02-26 Thread Julien Le Dem
merged.
Thank you

On Fri, Feb 26, 2016 at 8:33 AM, Wes McKinney  wrote:

> If someone could kindly merge this patch (PARQUET-494):
>
> https://github.com/apache/parquet-cpp/pull/64
>
> we'll then be able to close out the remaining JIRAs and hopefully tag
> a release candidate on Monday or Tuesday. Since we may find bugs after
> the RC, we may need to delay the actual release until end of the week
> / beginning of the following week.
>
> Let me know what process we should follow for having a release vote.
>
> Thanks!
> Wes
>
> On Wed, Feb 24, 2016 at 9:34 AM, Majeti, Deepak 
> wrote:
> > Sounds good to me. I am opening a JIRA to support Decimal Values. It
> should be a straightforward extension to FLBA with the additional
> requirement of swapping the bytes.
> >
> > On 02/23/2016 11:35 AM, Wes McKinney wrote:
> > It looks like we should be able to clear our current patch queue today
> which puts us in very good shape for the 0.1 release.
> >
> > I am planning to complete the dictionary encoding patches
> (PARQUET-493/494) by end of day tomorrow, so when those are done we should
> go into code scrubbing / bug fix mode unless there are any other major
> functional requirements that we are missing?
> >
> > - Wes
>



-- 
Julien


[jira] [Resolved] (PARQUET-494) Implement PLAIN_DICTIONARY encoding and decoding

2016-02-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-494.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 64
[https://github.com/apache/parquet-cpp/pull/64]

> Implement PLAIN_DICTIONARY encoding and decoding
> 
>
> Key: PARQUET-494
> URL: https://issues.apache.org/jira/browse/PARQUET-494
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: cpp-0.1
>
>
> parquet-cpp currently only supports {{Encoding::RLE_DICTIONARY}}. Some 
> implementations of Parquet still use {{Encoding::PLAIN_DICTIONARY}} (the 
> dictionary indices are not RLE-encoded). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-548) Add Java metadata for PageEncodingStats

2016-02-26 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-548:
-

 Summary: Add Java metadata for PageEncodingStats
 Key: PARQUET-548
 URL: https://issues.apache.org/jira/browse/PARQUET-548
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-mr
Reporter: Ryan Blue
Assignee: Ryan Blue


PARQUET-384 needs to determine whether an entire column chunk is 
dictionary-encoded, but it is difficult to detect that case based on the set of 
encodings for a column. For 1.0, this can be done by checking for a PLAIN page 
because both dictionary pages and dictionary-encoded pages use PLAIN_DICTIONARY 
and RLE/BIT_PACKING is only used for repetition and definition levels. But for 
2.0, dictionary pages might be using PLAIN and there is no way to tell if a 
column has fallen back.

[PageEncodingStats|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L449]
 were added to the format to solve this problem, so we just need to implement 
them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Question about ParquetInputSplitWrapper

2016-02-26 Thread Ryan Blue
Hi TianYi,

It looks like the DeprecatedParquetInputFormat also accepts mapred
FileSplits [1], so if you created a simple wrapper that delegated to
FileInputFormat for split planning and DeprecatedParquetInputFormat for
everything else, you could use CombineFileInputFormat with that. We kept
the wrapper when we moved to planning with file splits so that anyone
casting the splits they get back to the wrapper class or ParquetInputSplit
wouldn't break.

rb


[1]:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/mapred/DeprecatedParquetInputFormat.java#L95

On Thu, Feb 25, 2016 at 7:57 PM, Zhu, Tianyi  wrote:

> Hi all,
>
> I'm trying to read lots of small parquet files using scalding(cascading),
> and I was wondering if parquet is compatible with CombineFileInputFormat
> then I can read more files in one mapper. However, cascading is still using
> flowing class:
>
>
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/mapred/DeprecatedParquetInputFormat.java#L176
>
> This class is implementing InputSplit interface(incompatible with
> CombineFileInputFormat) rather than extending FileSplit class(compatible
> with CombineFileInputFormat).
>
> I'm wondering is that possible to remove this private class and use
> ParquetInputSplit instead?
>
> Thanks.
>
> --
> Ciao,
> TianYi ZHU
>
> ** IMPORTANT MESSAGE *
> This e-mail message is intended only for the addressee(s) and contains
> information which may be
> confidential.
> If you are not the intended recipient please advise the sender by return
> email, do not use or
> disclose the contents, and delete the message and any attachments from
> your system. Unless
> specifically indicated, this email does not constitute formal advice or
> commitment by the sender
> or the Commonwealth Bank of Australia (ABN 48 123 123 124) or its
> subsidiaries.
> We can be contacted through our web site: commbank.com.au.
> If you no longer wish to receive commercial electronic messages from us,
> please reply to this
> e-mail by typing Unsubscribe in the subject line.
> **
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix


[jira] [Updated] (PARQUET-545) Improve API to support Decimal type

2016-02-26 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-545:
--
Summary: Improve API to support Decimal type  (was: Support Decimal values)

> Improve API to support Decimal type
> ---
>
> Key: PARQUET-545
> URL: https://issues.apache.org/jira/browse/PARQUET-545
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>
> Extend the `ColumnDescriptor` API to return `precision` and `scale` values 
> from DecimalMetadata. Implement necessary checks if the `LogicalType` is 
> Decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-545) Support Decimal values

2016-02-26 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-545:
--
Description: Extend the `ColumnDescriptor` API to return `precision` and 
`scale` values from DecimalMetadata. Implement necessary checks if the 
`LogicalType` is Decimal.

> Support Decimal values
> --
>
> Key: PARQUET-545
> URL: https://issues.apache.org/jira/browse/PARQUET-545
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>
> Extend the `ColumnDescriptor` API to return `precision` and `scale` values 
> from DecimalMetadata. Implement necessary checks if the `LogicalType` is 
> Decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-545) Support Decimal values

2016-02-26 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169295#comment-15169295
 ] 

Deepak Majeti commented on PARQUET-545:
---

As far as I understood, Impala treats `Decimal` and `Timestamp` values similar 
to parquet primitive types and supports them at the encode/decode layer. 
`parquet-mr` on the other hand, leaves it to its clients (say `hive`) to 
interpret them from the parquet primitive types (eg: `FIXED_LEN_BYTE_ARRAY`). 
However, `parquet-mr` does some basic checks on `length` `precison` and `scale` 
values specified in the schema. I will update the scope of this JIRA.

> Support Decimal values
> --
>
> Key: PARQUET-545
> URL: https://issues.apache.org/jira/browse/PARQUET-545
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: parquet-cpp first 0.1 release planning & timeline

2016-02-26 Thread Wes McKinney
If someone could kindly merge this patch (PARQUET-494):

https://github.com/apache/parquet-cpp/pull/64

we'll then be able to close out the remaining JIRAs and hopefully tag
a release candidate on Monday or Tuesday. Since we may find bugs after
the RC, we may need to delay the actual release until end of the week
/ beginning of the following week.

Let me know what process we should follow for having a release vote.

Thanks!
Wes

On Wed, Feb 24, 2016 at 9:34 AM, Majeti, Deepak  wrote:
> Sounds good to me. I am opening a JIRA to support Decimal Values. It should 
> be a straightforward extension to FLBA with the additional requirement of 
> swapping the bytes.
>
> On 02/23/2016 11:35 AM, Wes McKinney wrote:
> It looks like we should be able to clear our current patch queue today which 
> puts us in very good shape for the 0.1 release.
>
> I am planning to complete the dictionary encoding patches (PARQUET-493/494) 
> by end of day tomorrow, so when those are done we should go into code 
> scrubbing / bug fix mode unless there are any other major functional 
> requirements that we are missing?
>
> - Wes