[jira] [Resolved] (PARQUET-1008) Update TypedColumnReader::ReadBatch method to accept batch_size as int64_t

2017-06-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1008.
---
   Resolution: Fixed
Fix Version/s: cpp-1.2.0

Issue resolved by pull request 349
[https://github.com/apache/parquet-cpp/pull/349]

> Update TypedColumnReader::ReadBatch method to accept batch_size as int64_t
> --
>
> Key: PARQUET-1008
> URL: https://issues.apache.org/jira/browse/PARQUET-1008
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Max Risuhin
>Assignee: Max Risuhin
> Fix For: cpp-1.2.0
>
>
> TypedColumnReader::ReadBatch method should take batch_size input param as 
> int64_t type instead of currently used int.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-884) Add support for Decimal datatype to Parquet-Pig record reader

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-884:
--
Fix Version/s: (was: 1.9.0)
   1.10.0

> Add support for Decimal datatype to Parquet-Pig record reader
> -
>
> Key: PARQUET-884
> URL: https://issues.apache.org/jira/browse/PARQUET-884
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-pig
>Reporter: Ellen Kletscher
>Assignee: Ellen Kletscher
>Priority: Minor
> Fix For: 1.10.0
>
>
> parquet.pig.ParquetLoader defaults the Parquet decimal datatype to bytearray. 
>  Would like to add support to convert to BigDecimal instead, which will turn 
> garbage bytearrays into actual numbers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-392) Release Parquet-mr 1.9.0

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-392.
---
Resolution: Delivered

> Release Parquet-mr 1.9.0
> 
>
> Key: PARQUET-392
> URL: https://issues.apache.org/jira/browse/PARQUET-392
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-392) Release Parquet-mr 1.9.0

2017-06-07 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041807#comment-16041807
 ] 

Julien Le Dem commented on PARQUET-392:
---

[~zi] done. Thanks for checking


> Release Parquet-mr 1.9.0
> 
>
> Key: PARQUET-392
> URL: https://issues.apache.org/jira/browse/PARQUET-392
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-392) Release Parquet-mr 1.9.0

2017-06-07 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041805#comment-16041805
 ] 

Julien Le Dem commented on PARQUET-392:
---

[~djiangxu] I have updated PARQUET-686: jiras should be closed automatically 
when we merge the PR. This one fell through: 
https://github.com/apache/parquet-mr/commit/de99127d77dabfc6c8134b3c58e0b9a0b74e5f37

> Release Parquet-mr 1.9.0
> 
>
> Key: PARQUET-392
> URL: https://issues.apache.org/jira/browse/PARQUET-392
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-686) Allow for Unsigned Statistics in Binary Type

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-686.
---
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 1.9.0

> Allow for Unsigned Statistics in Binary Type
> 
>
> Key: PARQUET-686
> URL: https://issues.apache.org/jira/browse/PARQUET-686
> Project: Parquet
>  Issue Type: Bug
>Reporter: Andrew Duffy
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> BinaryStatistics currently only have a min/max, which are compared as signed 
> {{byte[]}}. However, for real UTF8-friendly lexicographic comparison, e.g. 
> for string columns, we would want to calculate the BinaryStatistics based off 
> of a comparator that treats the bytes as unsigned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-686) Allow for Unsigned Statistics in Binary Type

2017-06-07 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041802#comment-16041802
 ] 

Julien Le Dem commented on PARQUET-686:
---

The first issue of not returning bad stats was solved in: 
https://github.com/apache/parquet-mr/commit/de99127d77dabfc6c8134b3c58e0b9a0b74e5f37

> Allow for Unsigned Statistics in Binary Type
> 
>
> Key: PARQUET-686
> URL: https://issues.apache.org/jira/browse/PARQUET-686
> Project: Parquet
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> BinaryStatistics currently only have a min/max, which are compared as signed 
> {{byte[]}}. However, for real UTF8-friendly lexicographic comparison, e.g. 
> for string columns, we would want to calculate the BinaryStatistics based off 
> of a comparator that treats the bytes as unsigned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-1024) allow for case insensitive parquet-xxx prefix in PR title

2017-06-07 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-1024:
--

 Summary: allow for case insensitive parquet-xxx prefix in PR title
 Key: PARQUET-1024
 URL: https://issues.apache.org/jira/browse/PARQUET-1024
 Project: Parquet
  Issue Type: Improvement
Reporter: Julien Le Dem
Assignee: Julien Le Dem






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-884) Add support for Decimal datatype to Parquet-Pig record reader

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-884.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 404
[https://github.com/apache/parquet-mr/pull/404]

> Add support for Decimal datatype to Parquet-Pig record reader
> -
>
> Key: PARQUET-884
> URL: https://issues.apache.org/jira/browse/PARQUET-884
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-pig
>Reporter: Ellen Kletscher
>Priority: Minor
> Fix For: 1.9.0
>
>
> parquet.pig.ParquetLoader defaults the Parquet decimal datatype to bytearray. 
>  Would like to add support to convert to BigDecimal instead, which will turn 
> garbage bytearrays into actual numbers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Parquet sync starting in 10 min

2017-06-07 Thread Julien Le Dem
Notes:
Attendees/agenda building:
Zoltan (Cloudera):
 - timestamp, min/max
Anna (cloudera)
Deepak (Vertica):
 - timestamp
 - c++/java: bloom filter.
Lars (Cloudera Impala)
 - page skipping indexes
 - open PRs
Pooja (Cloudera Impala):
 - page skipping indexes
Julien (Dremio):
 - page skipping indexes
 - timestamp


Agenda:
 - open PRs
  TODO (all): review:
   - https://github.com/apache/parquet-format/pull/54
   - https://github.com/apache/parquet-mr/pull/414
   - https://github.com/apache/parquet-mr/pull/411
   - https://github.com/apache/parquet-mr/pull/413
   - https://github.com/apache/parquet-mr/pull/410
  TODO:
follow up (Julien, Lars, Ryan): https://github.com/
apache/parquet-format/pull/53
Ryan follow up https://github.com/apache/parquet-format/pull/51
Julien more tests: https://github.com/apache/parquet-format/pull/50
Ryan follow up: https://github.com/apache/parquet-format/pull/49
 - PR triage:
   - TODO: Lars to do a pass on parquet-format
   - TODO: Julien to do a pass on parquet-mr
 - timestamps:
   - When reading from parquet to arrow if the timestamp isAdjusted to UTC
in arrow we use UTC timezone. otherwise no timezone (timestamp without
timezone)
   - follow up on jira about timestamp with timezone: PARQUET-906
 - min/max: PARQUET-686
   - final conclusion: https://github.com/apache/parquet-format/pull/46
   - PARQUET-839 => duplicate of PARQUET-686
   - TODO close obsolete PRs:
  -  
https://github.com/apache/parquet-format/pull/42
  - https://github.com/apache/parquet-mr/pull/362
   - We need an implementation in parquet-mr for the metadata in
https://github.com/apache/parquet-format/pull/46
  - TODO: Zoltan to open a jira
  - impala has an implementation, we should test they are compatible
 - bloom filter
   - PARQUET-319: see linked PR and doc.
  - https://github.com/apache/parquet-format/pull/28
  - https://docs.google.com/document/d/1mIZ0W24Cr79QHJWN1sQ3dIUc4lAK5
AVqozwSwtpFhW8/edit#heading=h.hmt1hrab3fpc
  - TODO: review and give feedback
 - page skipping indexes
- plan is prototype a writer in impala then a reader.
- We’ll review the results to finalize the metadata in 5-6 weeks.
- dealing with statistics coming from parquet-cpp
  - new min/max_value fields will be the reference


On Wed, Jun 7, 2017 at 10:54 AM, Wes McKinney  wrote:

> Sorry, I was unable to join the sync today. I'm interested to discuss
> more my comments on
>
> https://github.com/apache/parquet-format/pull/51#discussion_r119911623
>
> I'll wait for the notes from the call and maybe we can continue the
> discussion on GitHub
>
> On Wed, Jun 7, 2017 at 12:53 PM, Julien Le Dem  wrote:
> > 10am PT on google hangout:
> > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> >
> > Reminder that this is open to all.
> > Here is how it goes:
> > - we do a "round table" of people present where they quickly introduce
> > themselves and state the topics they wish discussed (if any. Being a "fly
> > on the wall" is totally fine too)
> > - based on that first round we summarize the agenda and go over the
> topics
> > one by one. (can be just bringing attention of people to a PR that needs
> a
> > review or asking if it makes sense to implement some new feature)
> >  - In the end we send notes back to the list and follow ups happen on
> JIRA,
> > github PRs and the dev list.
> >  - if the time is inconvenient to you say so on the list and we can
> figure
> > out something.
> >
> > --
> > Julien
>



-- 
Julien


[jira] [Assigned] (PARQUET-906) add logical type timestamp with timezone (per SQL)

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-906:
-

   Assignee: Julien Le Dem
Description: 
timestamp with timezone (per SQL)
timestamps are adjusted to UTC and stored as integers.
metadata in logical types PR:
See discussion here: 
https://github.com/apache/parquet-format/pull/51#discussion_r109667837



  was:
We need to clarify the spec here.
TODO: validate the following points.
timestamp with timezone (per SQL)
- each value has timezone
- TZ can be different for each value



> add logical type timestamp with timezone (per SQL)
> --
>
> Key: PARQUET-906
> URL: https://issues.apache.org/jira/browse/PARQUET-906
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Minor
>
> timestamp with timezone (per SQL)
> timestamps are adjusted to UTC and stored as integers.
> metadata in logical types PR:
> See discussion here: 
> https://github.com/apache/parquet-format/pull/51#discussion_r109667837



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: BYTE_ARRAY possible to getmax possible size?

2017-06-07 Thread Felipe Aramburu
So I guess its just calculating the distance between the offsets. For now
we might just make that part of our "catalogue" step. If I wanted to add it
to statistics is there somewhere you can point me to where that would be
added?

Felipe
ᐧ

On Wed, Jun 7, 2017 at 1:33 PM, Michael Howard 
wrote:

> > Could this be a candidate to add to the Statistics?
>
> Agreed ... this would be good info to have.
>
> On Wed, Jun 7, 2017 at 2:25 PM, Lars Volker  wrote:
>
> > Could this be a candidate to add to the Statistics?
> >
> > On Wed, Jun 7, 2017 at 11:18 AM, Deepak Majeti 
> > wrote:
> >
> > > The parquet metadata does not have such information.
> > >
> > >
> > > On Wed, Jun 7, 2017 at 1:08 PM, Felipe Aramburu 
> > > wrote:
> > >
> > > > Is there any metadata available on the maximum length of an element
> > > > BYTE_ARRAY in a row group.
> > > >
> > > > So for example if I have a column which is of type BYTE_ARRAY Logical
> > > type
> > > > UTF8 and I want to know what the longest possible element in the row
> > > group
> > > > is.
> > > >
> > > > I am looking for a method to do this which does NOT require having to
> > go
> > > > through the data itself. So I am asking if this metadata is stored
> > > > anywhere.
> > > >
> > > > Felipe
> > > >
> > >
> > >
> > >
> > > --
> > > regards,
> > > Deepak Majeti
> > >
> >
>


Re: BYTE_ARRAY possible to getmax possible size?

2017-06-07 Thread Michael Howard
> Could this be a candidate to add to the Statistics?

Agreed ... this would be good info to have.

On Wed, Jun 7, 2017 at 2:25 PM, Lars Volker  wrote:

> Could this be a candidate to add to the Statistics?
>
> On Wed, Jun 7, 2017 at 11:18 AM, Deepak Majeti 
> wrote:
>
> > The parquet metadata does not have such information.
> >
> >
> > On Wed, Jun 7, 2017 at 1:08 PM, Felipe Aramburu 
> > wrote:
> >
> > > Is there any metadata available on the maximum length of an element
> > > BYTE_ARRAY in a row group.
> > >
> > > So for example if I have a column which is of type BYTE_ARRAY Logical
> > type
> > > UTF8 and I want to know what the longest possible element in the row
> > group
> > > is.
> > >
> > > I am looking for a method to do this which does NOT require having to
> go
> > > through the data itself. So I am asking if this metadata is stored
> > > anywhere.
> > >
> > > Felipe
> > >
> >
> >
> >
> > --
> > regards,
> > Deepak Majeti
> >
>


Re: BYTE_ARRAY possible to getmax possible size?

2017-06-07 Thread Lars Volker
Could this be a candidate to add to the Statistics?

On Wed, Jun 7, 2017 at 11:18 AM, Deepak Majeti 
wrote:

> The parquet metadata does not have such information.
>
>
> On Wed, Jun 7, 2017 at 1:08 PM, Felipe Aramburu 
> wrote:
>
> > Is there any metadata available on the maximum length of an element
> > BYTE_ARRAY in a row group.
> >
> > So for example if I have a column which is of type BYTE_ARRAY Logical
> type
> > UTF8 and I want to know what the longest possible element in the row
> group
> > is.
> >
> > I am looking for a method to do this which does NOT require having to go
> > through the data itself. So I am asking if this metadata is stored
> > anywhere.
> >
> > Felipe
> >
>
>
>
> --
> regards,
> Deepak Majeti
>


Re: BYTE_ARRAY possible to getmax possible size?

2017-06-07 Thread Deepak Majeti
The parquet metadata does not have such information.


On Wed, Jun 7, 2017 at 1:08 PM, Felipe Aramburu 
wrote:

> Is there any metadata available on the maximum length of an element
> BYTE_ARRAY in a row group.
>
> So for example if I have a column which is of type BYTE_ARRAY Logical type
> UTF8 and I want to know what the longest possible element in the row group
> is.
>
> I am looking for a method to do this which does NOT require having to go
> through the data itself. So I am asking if this metadata is stored
> anywhere.
>
> Felipe
>



-- 
regards,
Deepak Majeti


[jira] [Commented] (PARQUET-392) Release Parquet-mr 1.9.0

2017-06-07 Thread Dong Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041333#comment-16041333
 ] 

Dong Jiang commented on PARQUET-392:


Curious about the status of PARQUET-686. So 1.9.0 is out even though the 
blocking issue PARQUET-686 is unresolved? Doesn't sound like a blocking issue 
at all?

> Release Parquet-mr 1.9.0
> 
>
> Key: PARQUET-392
> URL: https://issues.apache.org/jira/browse/PARQUET-392
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-392) Release Parquet-mr 1.9.0

2017-06-07 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041286#comment-16041286
 ] 

Zoltan Ivanfi commented on PARQUET-392:
---

Can this JIRA be closed? It seems to be obsolete as 1.9.0 is already out.

> Release Parquet-mr 1.9.0
> 
>
> Key: PARQUET-392
> URL: https://issues.apache.org/jira/browse/PARQUET-392
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Writing numpy arrays on disk using pyarrow-parquet

2017-06-07 Thread Wes McKinney
hi Vaishal,

I already replied to you about this on the mailing list on June 1, can
you reply to that thread?

I see that you opened ARROW-1097 about the tensor issue. If you could
add a standalone reproduction of the problem that would help us debug
it and fix faster

Thanks
Wes

On Wed, Jun 7, 2017 at 3:09 AM, Vaishal Shah  wrote:
> This is Vaishal from D. E. Shaw and Co.
>
>
>
> We are interested to use py-arrow/parquet for one of our projects, that
> deals with numpy arrays.
>
> Parquet provides API to store pandas dataframes on disk, but I could not
> find any support for storing numpy arrays.
>
>
> Since numpy is a trivial form to store data, I was surprised to find no
> function to store them in parquet format. Is there any way to store numpy
> array in parquet format, that I probably missed?
>
> Or can we expect this support in newer version of parquet?
>
>
> Pyarrow provides one using Tensors(but read_tensor requires file to be
> opened in writeable mode, so that compels to use mem_mapped files) and in
> order to read a file, it needs to be in writeable mode, that is kind of a
> bug! Can you please look into this?
>
>
>
> --
> *Regards*
>
> *Vaishal Shah,*
> *Third year Undergraduate student,*
> *Department of Computer Science and Engineering,*
> *IIT Kharagpur*


[jira] [Resolved] (PARQUET-839) Min-max should be computed based on logical type

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-839.
---
Resolution: Duplicate

> Min-max should be computed based on logical type
> 
>
> Key: PARQUET-839
> URL: https://issues.apache.org/jira/browse/PARQUET-839
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: format-2.3.1
>Reporter: Tim Armstrong
>
> The min/max stats are currently underspecified - it is not clear in any cases 
> from the spec what the expected ordering is.
> There are some related issues, like PARQUET-686 to fix specific problems, but 
> there seems to be a general assumption that the min/max should be defined 
> based on the primitive type instead of the logical type.
> However, this makes the stats nearly useless for some logical types. E.g. 
> consider a DECIMAL encoded into a (variable-length) BINARY. The min-max of 
> the underlying binary type is based on the lexical order of the byte string, 
> but that does not correspond to any reasonable ordering of the decimal 
> values. E.g. 16 (0x1 0x0) will be ordered between 1 (0x0) and (0x2). This 
> makes min-max filtering a lot less effective and would force query engines 
> using parquet to implement workarounds to produce correct results (e.g. 
> custom comparators).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


BYTE_ARRAY possible to getmax possible size?

2017-06-07 Thread Felipe Aramburu
Is there any metadata available on the maximum length of an element
BYTE_ARRAY in a row group.

So for example if I have a column which is of type BYTE_ARRAY Logical type
UTF8 and I want to know what the longest possible element in the row group
is.

I am looking for a method to do this which does NOT require having to go
through the data itself. So I am asking if this metadata is stored anywhere.

Felipe


Parquet sync starting in 10 min

2017-06-07 Thread Julien Le Dem
10am PT on google hangout:
https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up

Reminder that this is open to all.
Here is how it goes:
- we do a "round table" of people present where they quickly introduce
themselves and state the topics they wish discussed (if any. Being a "fly
on the wall" is totally fine too)
- based on that first round we summarize the agenda and go over the topics
one by one. (can be just bringing attention of people to a PR that needs a
review or asking if it makes sense to implement some new feature)
 - In the end we send notes back to the list and follow ups happen on JIRA,
github PRs and the dev list.
 - if the time is inconvenient to you say so on the list and we can figure
out something.

-- 
Julien


Writing numpy arrays on disk using pyarrow-parquet

2017-06-07 Thread Vaishal Shah
This is Vaishal from D. E. Shaw and Co.



We are interested to use py-arrow/parquet for one of our projects, that
deals with numpy arrays.

Parquet provides API to store pandas dataframes on disk, but I could not
find any support for storing numpy arrays.


Since numpy is a trivial form to store data, I was surprised to find no
function to store them in parquet format. Is there any way to store numpy
array in parquet format, that I probably missed?

Or can we expect this support in newer version of parquet?


Pyarrow provides one using Tensors(but read_tensor requires file to be
opened in writeable mode, so that compels to use mem_mapped files) and in
order to read a file, it needs to be in writeable mode, that is kind of a
bug! Can you please look into this?



-- 
*Regards*

*Vaishal Shah,*
*Third year Undergraduate student,*
*Department of Computer Science and Engineering,*
*IIT Kharagpur*


[jira] [Comment Edited] (PARQUET-815) Unable to create parquet file for the given data

2017-06-07 Thread Navya Krishnappa (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020637#comment-16020637
 ] 

Navya Krishnappa edited comment on PARQUET-815 at 6/7/17 4:08 PM:
--

Hi [~rdblue] Precision is positive only, but the scale is negative. While spark 
supports negative scale but parquet doesn't. In this case, we can not create 
parquet for such data. Please help me out in resolving this issue. 

Thank you


was (Author: navya krishnappa):
Hi [~rdblue] Precision is positive only, but the scale is negative. While spark 
supports negative scale but parquet doesn't. In this case, we can not create 
parquet for such dataset. Please help me out in resolving this issue. 

Thank you

> Unable to create parquet file for the given data
> 
>
> Key: PARQUET-815
> URL: https://issues.apache.org/jira/browse/PARQUET-815
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Navya Krishnappa
>Assignee: Ryan Blue
>
> When i'm trying to read the below mentioned csv source file and creating an 
> parquet file from that throws an java.lang.IllegalArgumentException: Invalid 
> DECIMAL scale: -9 exception.
> The source file content is 
> Row(column name)
> 9.03E+12
> 1.19E+11
> Refer the given code used read the csv file and creating an parquet file:
> //Read the csv file
> Dataset dataset = getSqlContext().read()
> .option(DAWBConstant.HEADER, "true")
> .option(DAWBConstant.PARSER_LIB, "commons")
> .option(DAWBConstant.INFER_SCHEMA, "true")
> .option(DAWBConstant.DELIMITER, ",")
> .option(DAWBConstant.QUOTE, "\"")
> .option(DAWBConstant.ESCAPE, "
> ")
> .option(DAWBConstant.MODE, Mode.PERMISSIVE)
> .csv(sourceFile)
> // create an parquet file
> dataset.write().parquet("//path.parquet")
> Stack trace:
> Caused by: java.lang.IllegalArgumentException: Invalid DECIMAL scale: -9
> at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
> at 
> org.apache.parquet.schema.Types$PrimitiveBuilder.decimalMetadata(Types.java:410)
> at org.apache.parquet.schema.Types$PrimitiveBuilder.build(Types.java:324)
> at org.apache.parquet.schema.Types$PrimitiveBuilder.build(Types.java:250)
> at org.apache.parquet.schema.Types$Builder.named(Types.java:228)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:412)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convert(ParquetSchemaConverter.scala:313)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:85)
> at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
> at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)



--
This message was sent by Atlassian 

[jira] [Commented] (PARQUET-1023) [C++] Brotli libraries are not being statically linked on Windows

2017-06-07 Thread Max Risuhin (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040889#comment-16040889
 ] 

Max Risuhin commented on PARQUET-1023:
--

[~wesmckinn], sure

> [C++] Brotli libraries are not being statically linked on Windows
> -
>
> Key: PARQUET-1023
> URL: https://issues.apache.org/jira/browse/PARQUET-1023
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.1.0
>Reporter: Wes McKinney
>Assignee: Max Risuhin
>
> When building with toolchain Brotli, the DLLs are required to be in the 
> runtime path. I think it's linking to the wrong .lib files
> [~Max Risuhin] could you take a look?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (PARQUET-1023) [C++] Brotli libraries are not being statically linked on Windows

2017-06-07 Thread Max Risuhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Risuhin reassigned PARQUET-1023:


Assignee: Max Risuhin

> [C++] Brotli libraries are not being statically linked on Windows
> -
>
> Key: PARQUET-1023
> URL: https://issues.apache.org/jira/browse/PARQUET-1023
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.1.0
>Reporter: Wes McKinney
>Assignee: Max Risuhin
>
> When building with toolchain Brotli, the DLLs are required to be in the 
> runtime path. I think it's linking to the wrong .lib files
> [~Max Risuhin] could you take a look?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)