[jira] [Resolved] (PARQUET-1024) allow for case insensitive parquet-xxx prefix in PR title

2017-06-09 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-1024.

   Resolution: Fixed
Fix Version/s: 1.9.1

Issue resolved by pull request 415
[https://github.com/apache/parquet-mr/pull/415]

> allow for case insensitive parquet-xxx prefix in PR title
> -
>
> Key: PARQUET-1024
> URL: https://issues.apache.org/jira/browse/PARQUET-1024
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 1.9.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't

2017-06-09 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created PARQUET-1028:
---

 Summary: [JAVA] When reading old Spark-generated files with INT96, 
stats are reported as valid when they aren't 
 Key: PARQUET-1028
 URL: https://issues.apache.org/jira/browse/PARQUET-1028
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.9.0
Reporter: Jacques Nadeau
 Fix For: 1.9.1


Found that the condition 
[here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55]
 is missing a check for INT96. Since INT96 statis are also corrupt with old 
versions of Parquet, the code here shouldn't short-circuit return.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Parquet-mr 1.9.1

2017-06-09 Thread Julien Le Dem
There is a need for a 1.9.1 release,
In particular for PARQUET-783
please chime in on the release ticket (
https://issues.apache.org/jira/browse/PARQUET-1027) if there's something
you need in there.


-- 
Julien


[jira] [Commented] (PARQUET-783) H2SeekableInputStream does not close its underlying FSDataInputStream, leading to connection leaks

2017-06-09 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044715#comment-16044715
 ] 

Julien Le Dem commented on PARQUET-783:
---

Hi [~fuka], I created a jira ticket to make a 1.9.1 release: PARQUET-1027
We should link to it any JIRA we think should be added and get it started soon.


> H2SeekableInputStream does not close its underlying FSDataInputStream, 
> leading to connection leaks
> --
>
> Key: PARQUET-783
> URL: https://issues.apache.org/jira/browse/PARQUET-783
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Michael Allman
>Assignee: Michael Allman
>Priority: Critical
> Fix For: 1.10.0
>
>
> {{ParquetFileReader}} opens a {{SeekableInputStream}} to read a footer. In 
> the process, it opens a new {{FSDataInputStream}} and wraps it. However, 
> {{H2SeekableInputStream}} does not override the {{close}} method. Therefore, 
> when {{ParquetFileReader}} closes it, the underlying {{FSDataInputStream}} is 
> not closed. As a result, these stale connections can exhaust a clusters' data 
> nodes' connection resources and lead to mysterious HDFS read failures in HDFS 
> clients, e.g.
> {noformat}
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: 
> BP-905337612-172.16.70.103-1444328960665:blk_1720536852_646811517
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-1027) releas Parquet-mr 1.9.1

2017-06-09 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-1027:
--

 Summary: releas Parquet-mr 1.9.1
 Key: PARQUET-1027
 URL: https://issues.apache.org/jira/browse/PARQUET-1027
 Project: Parquet
  Issue Type: Task
Reporter: Julien Le Dem






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-1027) release Parquet-mr 1.9.1

2017-06-09 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-1027:
---
Summary: release Parquet-mr 1.9.1  (was: releas Parquet-mr 1.9.1)

> release Parquet-mr 1.9.1
> 
>
> Key: PARQUET-1027
> URL: https://issues.apache.org/jira/browse/PARQUET-1027
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Parquet-cpp build on GCC4.8?

2017-06-09 Thread Wes McKinney
hi Young,

It looks like your Boost was compiled with a different version of gcc. If
you're targeting gcc 4.8 you need to compile all the dependencies with the
same compiler, otherwise you will have a conflict with the libstdc++ ABI.
Redhat provides the devtoolset which helps with deploying on a variety of
different Linux targets.

Let me know if this works!

Thanks
Wes


On Thu, Jun 8, 2017 at 6:59 PM, Young Park 
wrote:

> Hi,
>
> I have been building Parquet-cpp using GCC 5.4 without any issues, but due
> to another library dependency, I will have to use GCC 4.8 to compile
> Parquet-CPP for compatibility reasons.
>
> I saw that the README states that GCC 4.8 or higher is supported, but as I
> keep encountering errors that I have not encountered when using GCC5.4, I
> was wondering if there are any specific configurations in CMakeLists that I
> would need to change to get it to work.
>
> I have also attached the error log for your reference
>
> Scanning dependencies of target reader-writer
> [ 90%] Building CXX object examples/CMakeFiles/reader-
> writer.dir/reader-writer.cc.o
> [ 92%] Linking CXX executable ../build/debug/reader-writer
> ../build/debug/libparquet.a(metadata.cc.o): In function `bool
> boost::regex_match<__gnu_cxx::__normal_iterator std::string>, std::allocator x::__normal_iterator > >, char,
> boost::regex_traits
> >(__gnu_cxx::__normal_iterator, __g
> nu_cxx::__normal_iterator,
> boost::match_results<__gnu_cxx::__normal_iterator std::string>, std::allocator rmal_iterator > > >&, boost::basic_regex boost::regex_traits > const&,
> boost::regex_constants::_match_flags
> )':
> /usr/include/boost/regex/v4/regex_match.hpp:50: undefined reference to
> `boost::re_detail::perl_matcher<__gnu_cxx::__normal_iterator std::string>, std::allocator st::sub_match<__gnu_cxx::__normal_iterator > >,
> boost::regex_traits >::match()'
> ../build/debug/libparquet.a(metadata.cc.o): In function
> `boost::re_detail::perl_matcher<__gnu_cxx::__normal_iterator std::string>, std::allocator _gnu_cxx::__normal_iterator > >,
> boost::regex_traits
> >::perl_matcher(__gnu_cxx::__normal_iterator d::string>, __gnu_cxx::__normal_iterator,
> boost::match_results<__gnu_cxx::__normal_iterator std::string>, std::allocator __gnu_cxx::__normal_iterator > > >&,
> boost::basic_regex boost::cpp_regex_traits > > const&, boost::regex_constant
> s::_match_flags, __gnu_cxx::__normal_iterator)':
> /usr/include/boost/regex/v4/perl_matcher.hpp:365: undefined reference to
> `boost::re_detail::perl_matcher<__gnu_cxx::__normal_iterator std::string>, std::allocator oost::sub_match<__gnu_cxx::__normal_iterator >
> >, boost::regex_traits
> >::construct_init(boost::basic_regex boost::regex_traits > const&,
> boost::regex_constants::_match_flags)'
> collect2: error: ld returned 1 exit status
> examples/CMakeFiles/reader-writer.dir/build.make:103: recipe for target
> 'build/debug/reader-writer' failed
> make[2]: *** [build/debug/reader-writer] Error 1
> CMakeFiles/Makefile2:628: recipe for target 
> 'examples/CMakeFiles/reader-writer.dir/all'
> failed
> make[1]: *** [examples/CMakeFiles/reader-writer.dir/all] Error 2
> Makefile:138: recipe for target 'all' failed
>
> Thanks in advance for your help!
> Young
>


Re: BYTE_ARRAY possible to getmax possible size?

2017-06-09 Thread Deepak Majeti
I am not sure if there is API to directly modify serialized thrift data.

On Fri, Jun 9, 2017 at 5:45 AM, Felipe Aramburu 
wrote:

> If you were to store it as a seperate file, as opposed to making it  inside
> the file, then all of the sudden you are having to manage keeping those two
> files available. Your individual parquet files no longer contain all the
> information you may want or need.  Is it not possible to modify the
> contents of a parquet file so long as you are not changing the size of the
> data that was written? So for example if I have a parquet file and I want
> to modify the bytes in that file without changing the size of the file. I
> am pretty sure this is possible right?
>
> On Thu, Jun 8, 2017 at 11:09 PM, Deepak Majeti 
> wrote:
>
> > User metadata can be specified in parquet via key-value metadata. But,
> once
> > a parquet file has been written, any modification will require a
> re-write.
> > Basically, de-serialize, modify and serialize. Bytes that are part of the
> > parquet-format (spec) will require the above process.
> > If your proposal is to keep a scratchpad buffer say (64KB) at the end of
> a
> > file that is not part of the parquet-format, I don't see a lot of benefit
> > to it. Why not store the custom extensions as a separate file?
> >
> > And, this is indeed the right place to bring up more ideas.
> >
> >
> > On Thu, Jun 8, 2017 at 1:47 PM, Felipe Aramburu 
> > wrote:
> >
> > > It might be interesting at some point to consider specifying some extra
> > > bytes available in the metadata that can be used to read  potential
> > > extensions.
> > >
> > > Does that sound silly? Is this the right place to bring up ideas like
> > this?
> > >
> > > On Thu, Jun 8, 2017 at 11:34 AM, Lars Volker  wrote:
> > >
> > > > Yes, I can't think of a way to add this information to the file
> without
> > > at
> > > > least partially rewriting it. I don't know of a tool to update file
> > > > metadata without doing a complete rewrite.
> > > >
> > > > On Thu, Jun 8, 2017 at 9:04 AM, Felipe Aramburu <
> fel...@blazingdb.com>
> > > > wrote:
> > > >
> > > > > The answer to this is probably no. But I  imagine that it is not
> > > > considered
> > > > > acceptable to try and modify this statistics information AFTER the
> > > > parquet
> > > > > file has been generated correct?
> > > > > ᐧ
> > > > >
> > > > > On Thu, Jun 8, 2017 at 9:59 AM, Lars Volker 
> wrote:
> > > > >
> > > > > > I suppose you would look at the Statistics struct in the
> > > parquet.thrift
> > > > > >  > > > > > src/main/thrift/parquet.thrift>
> > > > > > file
> > > > > > in the parquet-format project. Before spending much time on this,
> > you
> > > > may
> > > > > > want to seek more feedback, possibly on this list, and by
> opening a
> > > > JIRA.
> > > > > > Since it likely is a rather small change, you might also go ahead
> > and
> > > > > > create a pull request and ask for feedback there. Please note,
> that
> > > the
> > > > > PR
> > > > > > will need a corresponding JIRA in its title.
> > > > > >
> > > > > > You can find more detailed information on the individual steps
> > here:
> > > > > > https://parquet.apache.org/contribute/
> > > > > >
> > > > > > Cheers, Lars
> > > > > >
> > > > > > On Wed, Jun 7, 2017 at 11:40 AM, Felipe Aramburu <
> > > fel...@blazingdb.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > So I guess its just calculating the distance between the
> offsets.
> > > For
> > > > > now
> > > > > > > we might just make that part of our "catalogue" step. If I
> wanted
> > > to
> > > > > add
> > > > > > it
> > > > > > > to statistics is there somewhere you can point me to where that
> > > would
> > > > > be
> > > > > > > added?
> > > > > > >
> > > > > > > Felipe
> > > > > > > ᐧ
> > > > > > >
> > > > > > > On Wed, Jun 7, 2017 at 1:33 PM, Michael Howard <
> > > > mhow...@podiumdata.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > > Could this be a candidate to add to the Statistics?
> > > > > > > >
> > > > > > > > Agreed ... this would be good info to have.
> > > > > > > >
> > > > > > > > On Wed, Jun 7, 2017 at 2:25 PM, Lars Volker  >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Could this be a candidate to add to the Statistics?
> > > > > > > > >
> > > > > > > > > On Wed, Jun 7, 2017 at 11:18 AM, Deepak Majeti <
> > > > > > > majeti.dee...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > The parquet metadata does not have such information.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 7, 2017 at 1:08 PM, Felipe Aramburu <
> > > > > > > fel...@blazingdb.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Is there any metadata available on the maximum length
> of
> > an
> > > > > > element
> > > > > > > > > > > BYTE_ARRAY in 

Re: BYTE_ARRAY possible to getmax possible size?

2017-06-09 Thread Felipe Aramburu
If you were to store it as a seperate file, as opposed to making it  inside
the file, then all of the sudden you are having to manage keeping those two
files available. Your individual parquet files no longer contain all the
information you may want or need.  Is it not possible to modify the
contents of a parquet file so long as you are not changing the size of the
data that was written? So for example if I have a parquet file and I want
to modify the bytes in that file without changing the size of the file. I
am pretty sure this is possible right?

On Thu, Jun 8, 2017 at 11:09 PM, Deepak Majeti 
wrote:

> User metadata can be specified in parquet via key-value metadata. But, once
> a parquet file has been written, any modification will require a re-write.
> Basically, de-serialize, modify and serialize. Bytes that are part of the
> parquet-format (spec) will require the above process.
> If your proposal is to keep a scratchpad buffer say (64KB) at the end of a
> file that is not part of the parquet-format, I don't see a lot of benefit
> to it. Why not store the custom extensions as a separate file?
>
> And, this is indeed the right place to bring up more ideas.
>
>
> On Thu, Jun 8, 2017 at 1:47 PM, Felipe Aramburu 
> wrote:
>
> > It might be interesting at some point to consider specifying some extra
> > bytes available in the metadata that can be used to read  potential
> > extensions.
> >
> > Does that sound silly? Is this the right place to bring up ideas like
> this?
> >
> > On Thu, Jun 8, 2017 at 11:34 AM, Lars Volker  wrote:
> >
> > > Yes, I can't think of a way to add this information to the file without
> > at
> > > least partially rewriting it. I don't know of a tool to update file
> > > metadata without doing a complete rewrite.
> > >
> > > On Thu, Jun 8, 2017 at 9:04 AM, Felipe Aramburu 
> > > wrote:
> > >
> > > > The answer to this is probably no. But I  imagine that it is not
> > > considered
> > > > acceptable to try and modify this statistics information AFTER the
> > > parquet
> > > > file has been generated correct?
> > > > ᐧ
> > > >
> > > > On Thu, Jun 8, 2017 at 9:59 AM, Lars Volker  wrote:
> > > >
> > > > > I suppose you would look at the Statistics struct in the
> > parquet.thrift
> > > > >  > > > > src/main/thrift/parquet.thrift>
> > > > > file
> > > > > in the parquet-format project. Before spending much time on this,
> you
> > > may
> > > > > want to seek more feedback, possibly on this list, and by opening a
> > > JIRA.
> > > > > Since it likely is a rather small change, you might also go ahead
> and
> > > > > create a pull request and ask for feedback there. Please note, that
> > the
> > > > PR
> > > > > will need a corresponding JIRA in its title.
> > > > >
> > > > > You can find more detailed information on the individual steps
> here:
> > > > > https://parquet.apache.org/contribute/
> > > > >
> > > > > Cheers, Lars
> > > > >
> > > > > On Wed, Jun 7, 2017 at 11:40 AM, Felipe Aramburu <
> > fel...@blazingdb.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > So I guess its just calculating the distance between the offsets.
> > For
> > > > now
> > > > > > we might just make that part of our "catalogue" step. If I wanted
> > to
> > > > add
> > > > > it
> > > > > > to statistics is there somewhere you can point me to where that
> > would
> > > > be
> > > > > > added?
> > > > > >
> > > > > > Felipe
> > > > > > ᐧ
> > > > > >
> > > > > > On Wed, Jun 7, 2017 at 1:33 PM, Michael Howard <
> > > mhow...@podiumdata.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > > Could this be a candidate to add to the Statistics?
> > > > > > >
> > > > > > > Agreed ... this would be good info to have.
> > > > > > >
> > > > > > > On Wed, Jun 7, 2017 at 2:25 PM, Lars Volker 
> > > wrote:
> > > > > > >
> > > > > > > > Could this be a candidate to add to the Statistics?
> > > > > > > >
> > > > > > > > On Wed, Jun 7, 2017 at 11:18 AM, Deepak Majeti <
> > > > > > majeti.dee...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > The parquet metadata does not have such information.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Jun 7, 2017 at 1:08 PM, Felipe Aramburu <
> > > > > > fel...@blazingdb.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Is there any metadata available on the maximum length of
> an
> > > > > element
> > > > > > > > > > BYTE_ARRAY in a row group.
> > > > > > > > > >
> > > > > > > > > > So for example if I have a column which is of type
> > BYTE_ARRAY
> > > > > > Logical
> > > > > > > > > type
> > > > > > > > > > UTF8 and I want to know what the longest possible element
> > in
> > > > the
> > > > > > row
> > > > > > > > > group
> > > > > > > > > > is.
> > > > > > > > > >
> > > > > > > > > > I am looking for a method to do this which does NOT
> require