[jira] [Updated] (PARQUET-1265) Segfault on static ApplicationVersion initialization

2018-04-04 Thread Lawrence Chan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lawrence Chan updated PARQUET-1265:
---
Description: 
I'm seeing a segfault when I link/run with a shared libparquet.so with 
statically linked boost. Given the backtrace, it seems that this is due to the 
static ApplicationVersion constants, likely due to some static initialization 
order issue. The problem goes away if I turn those static vars into static 
funcs returning function-local statics.

Backtrace:
{code}
#0  0x7753cf8b in std::basic_string::basic_string(std::string const&) () from 
/lib64/libstdc++.so.6
#1  0x77aeae9c in 
boost::re_detail_106600::cpp_regex_traits_char_layer::init() () from 
debug/libparquet.so.1
#2  0x77adcc2b in 
boost::object_cache::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, unsigned 
long) () from debug/libparquet.so.1
#3  0x77ae9023 in boost::basic_regex >::do_assign(char const*, char const*, unsigned 
int) () from debug/libparquet.so.1
#4  0x77a5ed98 in boost::basic_regex >::assign (this=0x7fff5580, 
p1=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 p2=0x77af6720 "", f=0) at 
/tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:381
#5  0x77a5b653 in boost::basic_regex >::assign (this=0x7fff5580, 
p=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:366
#6  0x77a57049 in boost::basic_regex >::basic_regex (this=0x7fff5580, 
p=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:335
#7  0x77a4fa1f in parquet::ApplicationVersion::ApplicationVersion 
(this=0x77ddbfc0 , 
created_by="parquet-mr version 1.8.0") at 
/tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:477
#8  0x77a516c5 in __static_initialization_and_destruction_0 
(__initialize_p=1, __priority=65535) at 
/tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:58
#9  0x77a5179e in _GLOBAL__sub_I_metadata.cc(void) () at 
/tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:913
#10 0x77dec1e3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#11 0x77dde21a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0001 in ?? ()
#13 0x7fff5ff5 in ?? ()
#14 0x in ?? ()
{code}

Versions:
- gcc-4.8.5
- boost-1.66.0
- parquet-cpp-1.4.0

  was:
I'm seeing a segfault when I link/run with a shared libparquet.so with 
statically linked boost. Given the backtrace, it seems that this is due to the 
static ApplicationVersion constants for the fixes, probably some static 
initialization order issue. The problem goes away if I turn those static vars 
into static funcs returning function-local statics.

Backtrace:
{code}
#0  0x7753cf8b in std::basic_string::basic_string(std::string const&) () from 
/lib64/libstdc++.so.6
#1  0x77aeae9c in 
boost::re_detail_106600::cpp_regex_traits_char_layer::init() () from 
debug/libparquet.so.1
#2  0x77adcc2b in 
boost::object_cache::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, unsigned 
long) () from debug/libparquet.so.1
#3  0x77ae9023 in boost::basic_regex >::do_assign(char const*, char const*, unsigned 
int) () from debug/libparquet.so.1
#4  0x77a5ed98 in boost::basic_regex >::assign (this=0x7fff5580, 
p1=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 p2=0x77af6720 "", f=0) at 
/tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:381
#5  0x77a5b653 in boost::basic_regex >::assign (this=0x7fff5580, 
p=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:366
#6  0x77a57049 in boost::basic_regex >::basic_regex (this=0x7fff5580, 

[jira] [Commented] (PARQUET-1265) Segfault on static ApplicationVersion initialization

2018-04-04 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426250#comment-16426250
 ] 

Lawrence Chan commented on PARQUET-1265:


Just to be explicit about what I mean, this seems to work:

{code}
const ApplicationVersion& ApplicationVersion::PARQUET_251_FIXED_VERSION() {
  static ApplicationVersion version("parquet-mr version 1.8.0");
  return version;
}
{code}

> Segfault on static ApplicationVersion initialization
> 
>
> Key: PARQUET-1265
> URL: https://issues.apache.org/jira/browse/PARQUET-1265
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Lawrence Chan
>Priority: Major
>
> I'm seeing a segfault when I link/run with a shared libparquet.so with 
> statically linked boost. Given the backtrace, it seems that this is due to 
> the static ApplicationVersion constants for the fixes, probably some static 
> initialization order issue. The problem goes away if I turn those static vars 
> into static funcs returning function-local statics.
> Backtrace:
> {code}
> #0  0x7753cf8b in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /lib64/libstdc++.so.6
> #1  0x77aeae9c in 
> boost::re_detail_106600::cpp_regex_traits_char_layer::init() () from 
> debug/libparquet.so.1
> #2  0x77adcc2b in 
> boost::object_cache boost::re_detail_106600::cpp_regex_traits_implementation 
> >::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, 
> unsigned long) () from debug/libparquet.so.1
> #3  0x77ae9023 in boost::basic_regex boost::cpp_regex_traits > >::do_assign(char const*, char const*, 
> unsigned int) () from debug/libparquet.so.1
> #4  0x77a5ed98 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff5580, 
> p1=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  p2=0x77af6720 "", f=0) at 
> /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:381
> #5  0x77a5b653 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff5580, 
> p=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:366
> #6  0x77a57049 in boost::basic_regex boost::cpp_regex_traits > >::basic_regex (this=0x7fff5580, 
> p=0x77af66d8 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:335
> #7  0x77a4fa1f in parquet::ApplicationVersion::ApplicationVersion 
> (this=0x77ddbfc0 
> , 
> created_by="parquet-mr version 1.8.0") at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:477
> #8  0x77a516c5 in __static_initialization_and_destruction_0 
> (__initialize_p=1, __priority=65535) at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:58
> #9  0x77a5179e in _GLOBAL__sub_I_metadata.cc(void) () at 
> /tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:913
> #10 0x77dec1e3 in _dl_init_internal () from 
> /lib64/ld-linux-x86-64.so.2
> #11 0x77dde21a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
> #12 0x0001 in ?? ()
> #13 0x7fff5ff5 in ?? ()
> #14 0x in ?? ()
> {code}
> Versions:
> - gcc-4.8.5
> - boost-1.66.0
> - parquet-cpp-1.4.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1265) Segfault on static ApplicationVersion initialization

2018-04-04 Thread Lawrence Chan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lawrence Chan updated PARQUET-1265:
---
Description: 
I'm seeing a segfault when I link/run with a shared libparquet.so with 
statically linked boost. Given the backtrace, it seems that this is due to the 
static ApplicationVersion constants for the fixes, probably some static 
initialization order issue. The problem goes away if I turn those static vars 
into static funcs returning function-local statics.

Backtrace:
{code}
#0  0x7753cf8b in std::basic_string::basic_string(std::string const&) () from 
/lib64/libstdc++.so.6
#1  0x77aeae9c in 
boost::re_detail_106600::cpp_regex_traits_char_layer::init() () from 
debug/libparquet.so.1
#2  0x77adcc2b in 
boost::object_cache::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, unsigned 
long) () from debug/libparquet.so.1
#3  0x77ae9023 in boost::basic_regex >::do_assign(char const*, char const*, unsigned 
int) () from debug/libparquet.so.1
#4  0x77a5ed98 in boost::basic_regex >::assign (this=0x7fff5580, 
p1=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 p2=0x77af6720 "", f=0) at 
/tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:381
#5  0x77a5b653 in boost::basic_regex >::assign (this=0x7fff5580, 
p=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:366
#6  0x77a57049 in boost::basic_regex >::basic_regex (this=0x7fff5580, 
p=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 f=0) at /tmp/boost-1.66.0/include/boost/regex/v4/basic_regex.hpp:335
#7  0x77a4fa1f in parquet::ApplicationVersion::ApplicationVersion 
(this=0x77ddbfc0 , 
created_by="parquet-mr version 1.8.0") at 
/tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:477
#8  0x77a516c5 in __static_initialization_and_destruction_0 
(__initialize_p=1, __priority=65535) at 
/tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:58
#9  0x77a5179e in _GLOBAL__sub_I_metadata.cc(void) () at 
/tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:913
#10 0x77dec1e3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#11 0x77dde21a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0001 in ?? ()
#13 0x7fff5ff5 in ?? ()
#14 0x in ?? ()
{code}

Versions:
- gcc-4.8.5
- boost-1.66.0
- parquet-cpp-1.4.0

  was:
I'm seeing a segfault when I link/run with a shared libparquet.so with 
statically linked boost. Given the backtrace, it seems that this is due to the 
static ApplicationVersion constants for the fixes, probably some static 
initialization order issue. The problem goes away if I turn those static vars 
into static funcs returning function-local statics.

Backtrace:
{code}
#0  0x7753cf8b in std::basic_string::basic_string(std::string const&) () from 
/lib64/libstdc++.so.6
#1  0x77aeae9c in 
boost::re_detail_106600::cpp_regex_traits_char_layer::init() () from 
debug/libparquet.so.1
#2  0x77adcc2b in 
boost::object_cache::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, unsigned 
long) () from debug/libparquet.so.1
#3  0x77ae9023 in boost::basic_regex >::do_assign(char const*, char const*, unsigned 
int) () from debug/libparquet.so.1
#4  0x77a5ed98 in boost::basic_regex >::assign (this=0x7fff5580, 
p1=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 p2=0x77af6720 "", f=0) at 
/home/modules/rhel7/boost-1.66.0-em/include/boost/regex/v4/basic_regex.hpp:381
#5  0x77a5b653 in boost::basic_regex >::assign (this=0x7fff5580, 
p=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 f=0) at 
/home/modules/rhel7/boost-1.66.0-em/include/boost/regex/v4/basic_regex.hpp:366
#6  0x77a57049 in boost::basic_regex

[jira] [Created] (PARQUET-1265) Segfault on static ApplicationVersion initialization

2018-04-04 Thread Lawrence Chan (JIRA)
Lawrence Chan created PARQUET-1265:
--

 Summary: Segfault on static ApplicationVersion initialization
 Key: PARQUET-1265
 URL: https://issues.apache.org/jira/browse/PARQUET-1265
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Affects Versions: cpp-1.4.0
Reporter: Lawrence Chan


I'm seeing a segfault when I link/run with a shared libparquet.so with 
statically linked boost. Given the backtrace, it seems that this is due to the 
static ApplicationVersion constants for the fixes, probably some static 
initialization order issue. The problem goes away if I turn those static vars 
into static funcs returning function-local statics.

Backtrace:
{code}
#0  0x7753cf8b in std::basic_string::basic_string(std::string const&) () from 
/lib64/libstdc++.so.6
#1  0x77aeae9c in 
boost::re_detail_106600::cpp_regex_traits_char_layer::init() () from 
debug/libparquet.so.1
#2  0x77adcc2b in 
boost::object_cache::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, unsigned 
long) () from debug/libparquet.so.1
#3  0x77ae9023 in boost::basic_regex >::do_assign(char const*, char const*, unsigned 
int) () from debug/libparquet.so.1
#4  0x77a5ed98 in boost::basic_regex >::assign (this=0x7fff5580, 
p1=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 p2=0x77af6720 "", f=0) at 
/home/modules/rhel7/boost-1.66.0-em/include/boost/regex/v4/basic_regex.hpp:381
#5  0x77a5b653 in boost::basic_regex >::assign (this=0x7fff5580, 
p=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 f=0) at 
/home/modules/rhel7/boost-1.66.0-em/include/boost/regex/v4/basic_regex.hpp:366
#6  0x77a57049 in boost::basic_regex >::basic_regex (this=0x7fff5580, 
p=0x77af66d8 
"(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
 f=0) at 
/home/modules/rhel7/boost-1.66.0-em/include/boost/regex/v4/basic_regex.hpp:335
#7  0x77a4fa1f in parquet::ApplicationVersion::ApplicationVersion 
(this=0x77ddbfc0 , 
created_by="parquet-mr version 1.8.0") at 
/tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:477
#8  0x77a516c5 in __static_initialization_and_destruction_0 
(__initialize_p=1, __priority=65535) at 
/tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:58
#9  0x77a5179e in _GLOBAL__sub_I_metadata.cc(void) () at 
/tmp/parquet-cpp-apache-parquet-cpp-1.4.0/src/parquet/metadata.cc:913
#10 0x77dec1e3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#11 0x77dde21a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0001 in ?? ()
#13 0x7fff5ff5 in ?? ()
#14 0x in ?? ()
{code}

Versions:
- gcc-4.8.5
- boost-1.66.0
- parquet-cpp-1.4.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1253) Support for new logical type representation

2018-04-04 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425994#comment-16425994
 ] 

Ryan Blue commented on PARQUET-1253:


Not including the UUID logical type in that union is probably an accident.

MAP_KEY_VALUE is no longer used. It is noted in [backward compatibility 
rules|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1],
 but is not required for any types.

The [comment "only valid for 
primitives"|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.5.0/src/main/thrift/parquet.thrift#L384]
 is incorrect. I think we can remove it. I'm not sure why the comment was there.

> Support for new logical type representation
> ---
>
> Key: PARQUET-1253
> URL: https://issues.apache.org/jira/browse/PARQUET-1253
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>
> Latest parquet-format 
> [introduced|https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe#diff-0f9d1b5347959e15259da7ba8f4b6252]
>  a new representation for logical types. As of now this is not yet supported 
> in parquet-mr, thus there's no way to use parametrized UTC normalized 
> timestamp data types. When reading and writing Parquet files, besides 
> 'converted_type' parquet-mr should use the new 'logicalType' field in 
> SchemaElement to tell the current logical type annotation. To maintain 
> backward compatibility, the semantic of converted_type shouldn't change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1261) Parquet-format interns strings when reading filemetadata

2018-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425919#comment-16425919
 ] 

ASF GitHub Bot commented on PARQUET-1261:
-

robert3005 commented on issue #92: PARQUET-1261 - Remove string interning
URL: https://github.com/apache/parquet-format/pull/92#issuecomment-378686932
 
 
   I've dug a bit more into jvm source code and it's slightly more 
complicated/not exactly as Scott is saying. String#intern does indeed end up on 
the StringTable in the jvm and there's no distinction between interened strings 
and what compiler/jvm interns. The problem though is that handling of that 
space is gc specific. The article that Scott links is totally accurate for 
default jvm settings and the links I posted were for CMS garbage collector. 
From my reading of the code it looks like interning is really only an issue 
under CMS (since it's very reluctant to retrieve space from it) while 
ParallelGC and G1 will consider it every time it does gc. Additionally 
interning or not you can get benefit of it by using `UseStringDeduplication` 
under G1 (default from java 9 onwards).
   
   I am doing some benchmarking but it seems that switching from string 
interning has potential to those using CMS gc and shouldn't make significant 
difference on newer jvm. Will update the pr once I am done benchmarking 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Parquet-format interns strings when reading filemetadata
> 
>
> Key: PARQUET-1261
> URL: https://issues.apache.org/jira/browse/PARQUET-1261
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Robert Kruszewski
>Assignee: Robert Kruszewski
>Priority: Major
>
> Parquet-format when deserializing metadata will intern strings. References I 
> could find suggested that it had been done to reduce memory pressure early 
> on. Java (and jvm in particular) went a long way since then and interning is 
> generally discouraged, see 
> [https://shipilev.net/jvm-anatomy-park/10-string-intern/] for a good 
> explanation. What is more since java 8 there's string deduplication 
> implemented at GC level per [http://openjdk.java.net/jeps/192.] During our 
> usage and testing we found the interning to cause significant gc pressure for 
> long running applications due to bigger GC root set.
> This issue proposes removing interning given it's questionable whether it 
> should be used in modern jvms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1261) Parquet-format interns strings when reading filemetadata

2018-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425885#comment-16425885
 ] 

ASF GitHub Bot commented on PARQUET-1261:
-

julienledem commented on issue #92: PARQUET-1261 - Remove string interning
URL: https://github.com/apache/parquet-format/pull/92#issuecomment-378676662
 
 
   thanks a lot for the details Scott.
   I'd like to add that the number of distinct strings here is not that big, 
since it is the name of the fields in schemas (individual field names not fully 
qualified paths). They are referred many times though especially when 
inspecting footers from many files. If that's a problem we can switch to a 
different deduping mechanism. It seems the overhead of a separate map would 
still be reasonable.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Parquet-format interns strings when reading filemetadata
> 
>
> Key: PARQUET-1261
> URL: https://issues.apache.org/jira/browse/PARQUET-1261
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Robert Kruszewski
>Assignee: Robert Kruszewski
>Priority: Major
>
> Parquet-format when deserializing metadata will intern strings. References I 
> could find suggested that it had been done to reduce memory pressure early 
> on. Java (and jvm in particular) went a long way since then and interning is 
> generally discouraged, see 
> [https://shipilev.net/jvm-anatomy-park/10-string-intern/] for a good 
> explanation. What is more since java 8 there's string deduplication 
> implemented at GC level per [http://openjdk.java.net/jeps/192.] During our 
> usage and testing we found the interning to cause significant gc pressure for 
> long running applications due to bigger GC root set.
> This issue proposes removing interning given it's questionable whether it 
> should be used in modern jvms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1259) Parquet-protobuf support both protobuf 2 and protobuf 3

2018-04-04 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-1259.

Resolution: Workaround

supporting more than one version adds complexity.

It sounds like people can use protobuf 2 syntax with protobuf 3 library

I would recommend that instead.

I'll close this for now.

Please re-open if this is not satisfying.

> Parquet-protobuf support both protobuf 2 and protobuf 3
> ---
>
> Key: PARQUET-1259
> URL: https://issues.apache.org/jira/browse/PARQUET-1259
> Project: Parquet
>  Issue Type: New Feature
>Affects Versions: 1.10.0, 1.9.1
>Reporter: Qinghui Xu
>Priority: Major
>
> With the merge of pull request: 
> [https://github.com/apache/parquet-mr/pull/407,] now it is protobuf 3 used in 
> parquet-protobuf, and this implies that it cannot work in an environment 
> where people are using protobuf 2 in their own dependencies because there is 
> some new API / breaking change in protobuf 3. People have to face a 
> dependency version conflict with next parquet-protobuf release (e.g. 1.9.1 or 
> 1.10.0).
> What if we support both protobuf 2 and protobuf 3 by providing 
> parquet-protobuf and parquet-protobuf2?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1253) Support for new logical type representation

2018-04-04 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425789#comment-16425789
 ] 

Nandor Kollar commented on PARQUET-1253:


While working on the new logical type representation three questions came to 
mind:
* Despite there is a Thrift struct for UUID logical type in parquet-format, it 
is not included into the LogicalType union. Is this on purpose, or was omitted 
accidentally? How should parquet-mr handle those schemas, where UUID annotation 
is used, but there's no corresponding LogicalType mapping?
* Similar question with MAP_KEY_VALUE, but it is not implemented at all in the 
new representation. What should parquet-mr do with those schemas, which use it 
in the old representation?
* In parquet-format the comment for {{optional LogicalType logicalType}} says 
{{"The logical type of this SchemaElement; only valid for primitives."}} but 
I'm confused, because there's a Map and a List logical type, which  - as far as 
I know - makes sense only on groups. What was the intention of this comment? Do 
I miss anything?

[~rdblue] I can see that you worked on the new logical type representation, 
could you please help me to clarify these questions?

> Support for new logical type representation
> ---
>
> Key: PARQUET-1253
> URL: https://issues.apache.org/jira/browse/PARQUET-1253
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>
> Latest parquet-format 
> [introduced|https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe#diff-0f9d1b5347959e15259da7ba8f4b6252]
>  a new representation for logical types. As of now this is not yet supported 
> in parquet-mr, thus there's no way to use parametrized UTC normalized 
> timestamp data types. When reading and writing Parquet files, besides 
> 'converted_type' parquet-mr should use the new 'logicalType' field in 
> SchemaElement to tell the current logical type annotation. To maintain 
> backward compatibility, the semantic of converted_type shouldn't change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1256) [C++] Add --print-key-value-metadata option to parquet_reader tool

2018-04-04 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1256:
-
Fix Version/s: cpp-1.5.0

> [C++] Add --print-key-value-metadata option to parquet_reader tool
> --
>
> Key: PARQUET-1256
> URL: https://issues.apache.org/jira/browse/PARQUET-1256
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jacek Pliszka
>Priority: Trivial
>  Labels: patch
> Fix For: cpp-1.5.0
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> Added --print-key-value-metadata option to parquet_reader tool
> https://github.com/apache/parquet-cpp/pull/450
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1256) Please review and merge: added --print-key-value-metadata option to parquet_reader tool

2018-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425661#comment-16425661
 ] 

ASF GitHub Bot commented on PARQUET-1256:
-

xhochy commented on issue #450: PARQUET-1256: Add --print-key-value-metadata 
option to parquet_reader tool
URL: https://github.com/apache/parquet-cpp/pull/450#issuecomment-378632891
 
 
   @JacekPliszka The build fails in Travis due to a conversion problem:
   
   ```
   /home/travis/build/apache/parquet-cpp/src/parquet/printer.cc: In member 
function ‘void parquet::ParquetFilePrinter::DebugPrint(std::ostream&, 
std::list, bool, bool, const char*)’:
   /home/travis/build/apache/parquet-cpp/src/parquet/printer.cc:47:64: error: 
conversion to ‘int’ from ‘int64_t {aka long int}’ may alter its value 
[-Werror=conversion]
int  size_of_key_value_metadata = key_value_metadata->size();
   ```
   
   Can you adjust the int type to the correct one? Then this should be ready to 
merge.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Please review and merge: added --print-key-value-metadata option to 
> parquet_reader tool
> ---
>
> Key: PARQUET-1256
> URL: https://issues.apache.org/jira/browse/PARQUET-1256
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jacek Pliszka
>Priority: Trivial
>  Labels: patch
> Fix For: cpp-1.5.0
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> Added --print-key-value-metadata option to parquet_reader tool
> https://github.com/apache/parquet-cpp/pull/450
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Solution to read/write multiple parquet files

2018-04-04 Thread Uwe L. Korn
> Then what is the best practice to cutting these rows into
> parquet files ?
This depends a bit on what you are going to do with them afterwards.
Typically RowGroups should sized such that you can load them in bulk
into memory if you do batch processing on them. If you only plan to
query on them, depending on the query engine having smaller or larger
RowGroups will make a performance difference. In general, it is best to
check what happens with these files and then profile.
> Another question is that should we keep same RowGroup size for one
> parquet file ?
You can vary the RowGroup size inside a Parquet file if that gives you a
better performance. Probably it is best to keep them even so that the
size of the materialized data in memory is the same for all RowGroups.
Uwe

On Tue, Apr 3, 2018, at 11:37 AM, Lizhou Gao wrote:
> Thanks for your quick reply!
> Given below  scenario, there are 200k rows of sql data, 0-100k
> contains more nulls while 100k-200k contains more not null values.> If we 
> convert two parts into parquet files, we may get 0-100k.parquet
> (500M), 100k-200k(1.3G). Then what is the best practice to cutting
> these rows into> parquet files ?
> Another question is that should we keep same RowGroup size for one
> parquet file ?> 
> 
> 
> Thanks,
> Lizhou
> -- Original --
> *From: * "Uwe L. Korn";
> *Date: * Tue, Apr 3, 2018 04:21 PM
> *To: * "dev"; 
> 
> *Subject: * Re: Solution to read/write multiple parquet files
>  
> Hello Lizhou,
> 
> on the Python side there is http://dask.pydata.org/en/latest/ that can
> read large, distributed Parquet datasets. When using `engine=pyarrow`,
> it also uses parquet-cpp under the hood.> 
> On the pure C++ side, I know that https://github.com/thrill/thrill has
> experimental parquet support. But this is an experimental feature in
> an experimental framework, so be careful about relying on it.> 
> In general, Parquet files should not exceed the single digit gigabyte
> size and the RowGroups inside these files should also be 128MiB or
> less. You will be able to write tools that can deal with other sizes
> but that will break a bit the portability aspect of Parquet file.> 
> Uwe
> 
> On Tue, Apr 3, 2018, at 10:00 AM, 高立周 wrote:
> > Hi experts,
> >We have a storage engine that needs to manage large set of data
> >(PB> > level) . Currently we store it as a single parquet file. After 
> > some> > searching, It seems the data should be cut into
> > multiple parquet files for further reading/writing/managing.  But I> > 
> > don't know whether there is already opensource solution to
> > read/write/> > manage multiple parquet files.  Our programming language is 
> > cpp.
> >   Any comments/suggestions are welcomed. Thanks!
> > 
> > 
> > Regards,
> > Lizhou



[jira] [Commented] (PARQUET-1261) Parquet-format interns strings when reading filemetadata

2018-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425274#comment-16425274
 ] 

ASF GitHub Bot commented on PARQUET-1261:
-

scottcarey commented on issue #92: PARQUET-1261 - Remove string interning
URL: https://github.com/apache/parquet-format/pull/92#issuecomment-378547346
 
 
   A way to dedupe strings that does not use String.intern() is to use a weak 
reference set.
   
   Basically, use a WeakHashMap, potentially wrapped via 
Collections.newSetFromMap and Collections.synchronizedSet.   If you need 
concurrency without synchronization, you can't easily use a ConcurrentHashMap, 
though you can write a class that extends WeakReference for keys and overrides 
the equals and hashCode method... this is slow because you have to wrap every 
access in a new WeakReference (which WeakHashMap avoids)
   
   If you have Guava on the classpath:  
https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Interners.html
  Is thread-safe and based on a concurrent map.  I would avoid having Guava on 
the classpath here because of version conflicts with user code, though perhaps 
shading a copy of the classes you need is fine.
   
   Sadly, you can't create something like this on your own with just Guava's 
MapMaker because they use a package protected method to allow using key 
equivalence instead of reference equality on weak keys.
   
   The simplest thing is using the JDK's WeakHashMap, and dealing with its 
thread safety issues.
   
   
   Now for my last point.  The original claim that OOMs were caused by 
String.intern() is bogus if it is JRE 7+.
   
   Read this: http://java-performance.info/string-intern-in-java-6-7-8/
   
   Strings that are no longer referenced are GC'd, and the claim that it is not 
done 'frequently' is false.  The code referenced on the mailing list was to 
when the JVM's class unloading code removes interned strings that were 
referenced statically by the class and has _nothing_ to do with user-mode calls 
to `String.intern()`.
   
   If OOM conditions are happening, it is not caused by calls to intern().   
And switching to a WeakHashMap will only increase the heap required to manage 
it.   Increasing the StringTable size might help somewhat on the performance 
side (the benchmark at shiplev.net clearly shows how the performance tanks when 
the distinct string count exceeds the table side).
   
   So in summary:
   
   1.  The original concern seems incorrect, unless it is running on JRE 6 and 
has too-small of a perm gen configured.  Otherwise, this should not lead to OOM 
since the Strings are strongly referenced anyway.  However, String.intern can 
lead to some performance issues with under-sized string tables.
   2.  Doing the interning by hand with Guava's Interner or a WeakHashMap might 
be a win performance wise, but will likely use a bit more heap.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Parquet-format interns strings when reading filemetadata
> 
>
> Key: PARQUET-1261
> URL: https://issues.apache.org/jira/browse/PARQUET-1261
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Robert Kruszewski
>Assignee: Robert Kruszewski
>Priority: Major
>
> Parquet-format when deserializing metadata will intern strings. References I 
> could find suggested that it had been done to reduce memory pressure early 
> on. Java (and jvm in particular) went a long way since then and interning is 
> generally discouraged, see 
> [https://shipilev.net/jvm-anatomy-park/10-string-intern/] for a good 
> explanation. What is more since java 8 there's string deduplication 
> implemented at GC level per [http://openjdk.java.net/jeps/192.] During our 
> usage and testing we found the interning to cause significant gc pressure for 
> long running applications due to bigger GC root set.
> This issue proposes removing interning given it's questionable whether it 
> should be used in modern jvms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)