[jira] [Commented] (PARQUET-1455) [parquet-protobuf] Handle "unknown" enum values for parquet-protobuf

2020-08-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172690#comment-17172690
 ] 

ASF GitHub Bot commented on PARQUET-1455:
-

browles commented on pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#issuecomment-670219375


   Hey guys, what is the status on this PR? We currently experiencing this 
issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [parquet-protobuf] Handle "unknown" enum values for parquet-protobuf
> 
>
> Key: PARQUET-1455
> URL: https://issues.apache.org/jira/browse/PARQUET-1455
> Project: Parquet
>  Issue Type: Bug
>Reporter: Qinghui Xu
>Assignee: Qinghui Xu
>Priority: Major
>  Labels: pull-request-available
>
> Background - 
> In protobuf enum is more like integers other than string, and is encoded as 
> integer on the wire.
> In Protobuf, each enum value is associated with a number (integer), and 
> people can set enum field using number directly regardless whether the number 
> is associated to an enum value or not. While enum filed is set with a number 
> that does not match any enum value defined in the schema, by using protobuf 
> reflection API (as parquet-protobuf does) to read the enum field we will get 
> a label "UNKNOWN_ENUM__" generated by protobuf reflection. 
> Thus parquet-protobuf will write string "UNKNOWN_ENUM__" 
> into the enum column whenever its protobuf schema does not recognize the 
> number.
>  
> Problematics -
> There are two cases of unknown enum while using parquet-protobuf:
>  1. Protobuf already contains unknown enum when we write it to parquet 
> (sometimes people manipulate enum using numbers), so it will write a label 
> "UNKNOWN_ENUM_*" as string in parquet. And when we read it back to protobuf, 
> we found this "true" unknown value
>  2. Protobuf contains valid value when write to parquet, but the reader uses 
> an outdated proto schema which misses some enum values. So the 
> not-in-old-schema enum values are "unknown" to the reader.
> Current behavior of parquet-proto reader is to reject in both cases with some 
> runtime exception. This does not make sense in case 1, the write part does 
> respect protobuf enum behavior while the read part does not. And case 2 
> should be handled if protobuf user is interested in the number instead of 
> label.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] browles commented on pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

2020-08-06 Thread GitBox


browles commented on pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#issuecomment-670219375


   Hey guys, what is the status on this PR? We currently experiencing this 
issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (PARQUET-1895) Update jackson-databind to 2.9.10.5

2020-08-06 Thread Patrick OFriel (Jira)
Patrick OFriel created PARQUET-1895:
---

 Summary: Update jackson-databind to 2.9.10.5
 Key: PARQUET-1895
 URL: https://issues.apache.org/jira/browse/PARQUET-1895
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Affects Versions: 1.11.0
Reporter: Patrick OFriel
 Fix For: 1.12.0


The jackson databind 2.9.10.4 has the following CVEs:

[https://nvd.nist.gov/vuln/detail/CVE-2020-14060]
[https://nvd.nist.gov/vuln/detail/CVE-2020-14061]
[https://nvd.nist.gov/vuln/detail/CVE-2020-14062]
[https://nvd.nist.gov/vuln/detail/CVE-2020-14195]

They should be resolved if we update to 2.9.10.5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] shangxinli removed a comment on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-06 Thread GitBox


shangxinli removed a comment on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-670162874


   Thanks Gidon! No problem. We have deployed completely, so no hurry for
   reviewing. As long as it is merged to upstream, we have a path to go in the
   future.
   
   For the Configuration setting approach, that was our initial thinking but
   soonly we found out it requires Parquet applications(We have all 3 query
   engines: Spark/Hive/Presto and in-house Parquet tools) to be changed to set
   the Configuration. So we change to use extended ParquetWriteSupport as
   plugin mode. That is how we get here today.
   
   On Thu, Aug 6, 2020 at 12:48 PM ggershinsky 
   wrote:
   
   > Thanks @shangxinli  for the detailed
   > response, much appreciated! I want to allocate ample time to get to the
   > bottom of this subject, and with the weekend starting here, it might be a
   > problem in the next few days; but I'll get back on this early next week.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > ,
   > or unsubscribe
   > 

   > .
   >
   
   
   -- 
   Xinli Shang
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-06 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-670162874


   Thanks Gidon! No problem. We have deployed completely, so no hurry for
   reviewing. As long as it is merged to upstream, we have a path to go in the
   future.
   
   For the Configuration setting approach, that was our initial thinking but
   soonly we found out it requires Parquet applications(We have all 3 query
   engines: Spark/Hive/Presto and in-house Parquet tools) to be changed to set
   the Configuration. So we change to use extended ParquetWriteSupport as
   plugin mode. That is how we get here today.
   
   On Thu, Aug 6, 2020 at 12:48 PM ggershinsky 
   wrote:
   
   > Thanks @shangxinli  for the detailed
   > response, much appreciated! I want to allocate ample time to get to the
   > bottom of this subject, and with the weekend starting here, it might be a
   > problem in the next few days; but I'll get back on this early next week.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > ,
   > or unsubscribe
   > 

   > .
   >
   
   
   -- 
   Xinli Shang
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] ggershinsky commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-06 Thread GitBox


ggershinsky commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-670157959


   Thanks @shangxinli for the detailed response, much appreciated! I want to 
allocate ample time to get to the bottom of this subject, and with the weekend 
starting here, it might be a problem in the next few days; but I'll get back on 
this early next week.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-06 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-670094258


   Hi @ggershinsky, Thanks for your comments! 
   
   I think the current difference is how to transport the crypto settings from 
source to destination(EncryptionPropertiesFactory implementation). What you 
suggest is via 'Configuration' while our approach is hybrid('Configuration' for 
global settings 'Schema' for db/table/column settings). The concern for using 
'Configuration' for all the settings is that it is global and we need a 
well-designed namespace and protocols(string format) to locate 
db/table/column(nested), along with other table wide settings as the map key 
which will need a lot of string operations and is problematic. The hybrid 
solution is more compact as the metadata is attached to the table/column as a 
structure field. Think about if I want to encrypt footer for table1 but not for 
table2, and AES-GCM for table3 but AES-CTR for table4. If we use 
'Configuration', it is doable but see how complex to define the string format 
as the protocol to convey this setting. And sometimes, a job needs to deal with 
different meta sto
 res, which would require a prefix before 'db'. 
   
   Regarding changes to existing code, I understand if we set Configuration, 
then no need to change the code after that. But somewhere need to put the 
settings into Configuration and it will be code change in some cases. Think 
about somebody(might not be engineers) sets a table/column to be encrypted in 
HMS or other centralized metastore. Then that setting needs to converted into 
settings in Configuration, then code change is unavoidable, correct? We prefer 
the changes happening in a plugin owned by a dedicated team instead of in the 
feature team's codebase that are so many. That is why we extend 
ParquetWriteSupport and build it as a plugin. 
   
   There is an option that in the extended ParquetWriteSupport, we can put it 
to Configuration instead of Parquet schema. As mentioned above, we think Schema 
settings in this case would be more compact and less error-prone than in 
Configuration. 
   
   In my opinion, your solution is better to use at the time when users create 
a table and want to set some columns to be encrypted. They only need to set it 
in Configuration. But it is difficult to use if you already have a lot of 
tables that are not encrypted today. Due to some initiatives, they need to be 
encrypted. Somebody(might not be the table owner, it could be security team, or 
legal team etc) change the settings in the megastore for a table/column, the 
framework would just start the encryption for that table/column. That is why we 
called it schema controlled because when the schema defined in metastore is 
changed,  it controlled the encryption of the downstream.  
 
   I slightly disagree with your statement that it is not Schema activated. I 
think it depends on the definition of Schema. In our definition, the settings 
for a table or a column including name, type, nullable, crypto settings are all 
considered as part of the schema.  Of course, you can define it another way but 
I think the name or definition is not important. We should focus on whether or 
not it makes sense for users to use.   
   
   Essentially this change is mainly introducing a map field for Parquet 
schema, while all other changes are in tests(or plugin). As mentioned earlier, 
if we look at the schema definition in other systems like Avro, Spark, they all 
have associated metadata. I think it is the right thing to do to add that 
metadata field in Parquet schema to align with other system's schema.  
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (PARQUET-1856) [C++] Test suite assumes that Snappy support is built

2020-08-06 Thread Keith Hughitt (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172392#comment-17172392
 ] 

Keith Hughitt edited comment on PARQUET-1856 at 8/6/20, 2:08 PM:
-

Hmm. Possibly unrelated, but I encounter similar errors when 
`DARROW_WITH_SNAPPY=ON`, even when Snappy was properly detected, e.g.:

{code:java}
– Building using CMake version: 3.18.1
 – Arrow version: 1.0.0 (full: '1.0.0')
 – Arrow SO version: 100 (full: 100.0.0)
 ..
 – ARROW_SNAPPY_BUILD_VERSION: 1.1.8{code}

>From the test logs ({{ag -B2 -A2 snapp 
>src/build/Testing/Temporary/LastTest.log --nonumbers}}):

{code:java}
[ RUN ] TestStatisticsSortOrder/0.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/0.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)1> (0 ms)
[--] 1 test from TestStatisticsSortOrder/0 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/1.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/1.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)2> (0 ms)
[--] 1 test from TestStatisticsSortOrder/1 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/2.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/2.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)4> (0 ms)
[--] 1 test from TestStatisticsSortOrder/2 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/3.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/3.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)5> (0 ms)
[--] 1 test from TestStatisticsSortOrder/3 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/4.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/4.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)6> (0 ms)
[--] 1 test from TestStatisticsSortOrder/4 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/5.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/5.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)7> (0 ms)
[--] 1 test from TestStatisticsSortOrder/5 (0 ms total)
–
[ RUN ] TestStatisticsSortOrderFLBA.UnknownSortOrder
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrderFLBA.UnknownSortOrder (0 ms)
[--] 1 test from TestStatisticsSortOrderFLBA (0 ms total)
–
[ RUN ] TestDumpWithLocalFile.DumpOutput
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestDumpWithLocalFile.DumpOutput (1 ms)
[--] 1 test from TestDumpWithLocalFile (1 ms total)
–
expected_code
Which is: 1-byte object <00>
Expected reading file to return OK, but got IOError: NotImplemented: Snappy 
codec support not built
[ FAILED ] TestArrowReaderAdHoc.HandleDictPageOffsetZero (1 ms)
[--] 3 tests from TestArrowReaderAdHoc (2 ms total)
{code}
 

*System Info:*
 * Arch Linux 5.7.12 (64-bit)
 * CMake 3.18.1
 * Snappy 1.1.8
 * Arrow 1.0.0


was (Author: khughitt):
Hmm. Possibly unrelated, but I encounter similar errors when 
`DARROW_WITH_SNAPPY=ON`, even when Snappy was properly detected, e.g.:
{code:java}
– Building using CMake version: 3.18.1
 – Arrow version: 1.0.0 (full: '1.0.0')
 – Arrow SO version: 100 (full: 100.0.0)
 ..
 – ARROW_SNAPPY_BUILD_VERSION: 1.1.8{code}
>From the test logs ({{ag -B2 -A2 snapp 
>src/build/Testing/Temporary/LastTest.log --nonumbers}}):
{code:java}
[ RUN ] TestStatisticsSortOrder/0.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/0.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)1> (0 ms)
[--] 1 test from TestStatisticsSortOrder/0 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/1.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/1.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)2> (0 ms)
[--] 1 test from TestStatisticsSortOrder/1 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/2.MinMax
unknown 

[jira] [Comment Edited] (PARQUET-1856) [C++] Test suite assumes that Snappy support is built

2020-08-06 Thread Keith Hughitt (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172392#comment-17172392
 ] 

Keith Hughitt edited comment on PARQUET-1856 at 8/6/20, 2:07 PM:
-

Hmm. Possibly unrelated, but I encounter similar errors when 
`DARROW_WITH_SNAPPY=ON`, even when Snappy was properly detected, e.g.:
{code:java}
– Building using CMake version: 3.18.1
 – Arrow version: 1.0.0 (full: '1.0.0')
 – Arrow SO version: 100 (full: 100.0.0)
 ..
 – ARROW_SNAPPY_BUILD_VERSION: 1.1.8{code}
>From the test logs ({{ag -B2 -A2 snapp 
>src/build/Testing/Temporary/LastTest.log --nonumbers}}):
{code:java}
[ RUN ] TestStatisticsSortOrder/0.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/0.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)1> (0 ms)
[--] 1 test from TestStatisticsSortOrder/0 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/1.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/1.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)2> (0 ms)
[--] 1 test from TestStatisticsSortOrder/1 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/2.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/2.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)4> (0 ms)
[--] 1 test from TestStatisticsSortOrder/2 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/3.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/3.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)5> (0 ms)
[--] 1 test from TestStatisticsSortOrder/3 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/4.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/4.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)6> (0 ms)
[--] 1 test from TestStatisticsSortOrder/4 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/5.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/5.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)7> (0 ms)
[--] 1 test from TestStatisticsSortOrder/5 (0 ms total)
–
[ RUN ] TestStatisticsSortOrderFLBA.UnknownSortOrder
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrderFLBA.UnknownSortOrder (0 ms)
[--] 1 test from TestStatisticsSortOrderFLBA (0 ms total)
–
[ RUN ] TestDumpWithLocalFile.DumpOutput
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestDumpWithLocalFile.DumpOutput (1 ms)
[--] 1 test from TestDumpWithLocalFile (1 ms total)
–
expected_code
Which is: 1-byte object <00>
Expected reading file to return OK, but got IOError: NotImplemented: Snappy 
codec support not built
[ FAILED ] TestArrowReaderAdHoc.HandleDictPageOffsetZero (1 ms)
[--] 3 tests from TestArrowReaderAdHoc (2 ms total)
{code}
 

*System Info:*
 * Arch Linux 5.7.12 (64-bit)
 * CMake 3.18.1
 * Snappy 1.1.8
 * Arrow 1.0.0


was (Author: khughitt):
Hmm. Possibly unrelated, but I encounter similar errors when 
`DARROW_WITH_SNAPPY=ON`, even when Snappy was properly detected, e.g.:

 

```

– Building using CMake version: 3.18.1
– Arrow version: 1.0.0 (full: '1.0.0')
– Arrow SO version: 100 (full: 100.0.0)

..
– ARROW_SNAPPY_BUILD_VERSION: 1.1.8

```

 

>From the test logs (`ag -B2 -A2 snapp src/build/Testing/Temporary/LastTest.log 
>--nonumbers`):

 

```

[ RUN ] TestStatisticsSortOrder/0.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/0.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)1> (0 ms)
[--] 1 test from TestStatisticsSortOrder/0 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/1.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/1.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)2> (0 ms)
[--] 1 test from TestStatisticsSortOrder/1 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/2.MinMax
unknown file: 

[jira] [Comment Edited] (PARQUET-1856) [C++] Test suite assumes that Snappy support is built

2020-08-06 Thread Keith Hughitt (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172392#comment-17172392
 ] 

Keith Hughitt edited comment on PARQUET-1856 at 8/6/20, 2:03 PM:
-

Hmm. Possibly unrelated, but I encounter similar errors when 
`DARROW_WITH_SNAPPY=ON`, even when Snappy was properly detected, e.g.:

 

```

– Building using CMake version: 3.18.1
– Arrow version: 1.0.0 (full: '1.0.0')
– Arrow SO version: 100 (full: 100.0.0)

..
– ARROW_SNAPPY_BUILD_VERSION: 1.1.8

```

 

>From the test logs (`ag -B2 -A2 snapp src/build/Testing/Temporary/LastTest.log 
>--nonumbers`):

 

```

[ RUN ] TestStatisticsSortOrder/0.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/0.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)1> (0 ms)
[--] 1 test from TestStatisticsSortOrder/0 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/1.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/1.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)2> (0 ms)
[--] 1 test from TestStatisticsSortOrder/1 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/2.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/2.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)4> (0 ms)
[--] 1 test from TestStatisticsSortOrder/2 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/3.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/3.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)5> (0 ms)
[--] 1 test from TestStatisticsSortOrder/3 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/4.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/4.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)6> (0 ms)
[--] 1 test from TestStatisticsSortOrder/4 (0 ms total)
–
[ RUN ] TestStatisticsSortOrder/5.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/5.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)7> (0 ms)
[--] 1 test from TestStatisticsSortOrder/5 (0 ms total)
–
[ RUN ] TestStatisticsSortOrderFLBA.UnknownSortOrder
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrderFLBA.UnknownSortOrder (0 ms)
[--] 1 test from TestStatisticsSortOrderFLBA (0 ms total)
–
[ RUN ] TestDumpWithLocalFile.DumpOutput
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestDumpWithLocalFile.DumpOutput (1 ms)
[--] 1 test from TestDumpWithLocalFile (1 ms total)
–
expected_code
Which is: 1-byte object <00>
Expected reading file to return OK, but got IOError: NotImplemented: Snappy 
codec support not built
[ FAILED ] TestArrowReaderAdHoc.HandleDictPageOffsetZero (1 ms)
[--] 3 tests from TestArrowReaderAdHoc (2 ms total)

```

 


was (Author: khughitt):
Hmm. Possibly unrelated, but I encounter similar errors when 
`DARROW_WITH_SNAPPY=ON`, even when Snappy was properly detected, e.g.:

 

```

-- Building using CMake version: 3.18.1
-- Arrow version: 1.0.0 (full: '1.0.0')
-- Arrow SO version: 100 (full: 100.0.0)

..
-- ARROW_SNAPPY_BUILD_VERSION: 1.1.8

```

 

>From the test logs (`ag -B2 -A2 snapp src/build/Testing/Temporary/LastTest.log 
>--nonumbers`):

 

```

[ RUN ] TestStatisticsSortOrder/0.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/0.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)1> (0 ms)
[--] 1 test from TestStatisticsSortOrder/0 (0 ms total)
--
[ RUN ] TestStatisticsSortOrder/1.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/1.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)2> (0 ms)
[--] 1 test from TestStatisticsSortOrder/1 (0 ms total)
--
[ RUN ] TestStatisticsSortOrder/2.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the 

[jira] [Commented] (PARQUET-1856) [C++] Test suite assumes that Snappy support is built

2020-08-06 Thread Keith Hughitt (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172392#comment-17172392
 ] 

Keith Hughitt commented on PARQUET-1856:


Hmm. Possibly unrelated, but I encounter similar errors when 
`DARROW_WITH_SNAPPY=ON`, even when Snappy was properly detected, e.g.:

 

```

-- Building using CMake version: 3.18.1
-- Arrow version: 1.0.0 (full: '1.0.0')
-- Arrow SO version: 100 (full: 100.0.0)

..
-- ARROW_SNAPPY_BUILD_VERSION: 1.1.8

```

 

>From the test logs (`ag -B2 -A2 snapp src/build/Testing/Temporary/LastTest.log 
>--nonumbers`):

 

```

[ RUN ] TestStatisticsSortOrder/0.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/0.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)1> (0 ms)
[--] 1 test from TestStatisticsSortOrder/0 (0 ms total)
--
[ RUN ] TestStatisticsSortOrder/1.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/1.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)2> (0 ms)
[--] 1 test from TestStatisticsSortOrder/1 (0 ms total)
--
[ RUN ] TestStatisticsSortOrder/2.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/2.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)4> (0 ms)
[--] 1 test from TestStatisticsSortOrder/2 (0 ms total)
--
[ RUN ] TestStatisticsSortOrder/3.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/3.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)5> (0 ms)
[--] 1 test from TestStatisticsSortOrder/3 (0 ms total)
--
[ RUN ] TestStatisticsSortOrder/4.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/4.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)6> (0 ms)
[--] 1 test from TestStatisticsSortOrder/4 (0 ms total)
--
[ RUN ] TestStatisticsSortOrder/5.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrder/5.MinMax, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)7> (0 ms)
[--] 1 test from TestStatisticsSortOrder/5 (0 ms total)
--
[ RUN ] TestStatisticsSortOrderFLBA.UnknownSortOrder
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestStatisticsSortOrderFLBA.UnknownSortOrder (0 ms)
[--] 1 test from TestStatisticsSortOrderFLBA (0 ms total)
--
[ RUN ] TestDumpWithLocalFile.DumpOutput
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
[ FAILED ] TestDumpWithLocalFile.DumpOutput (1 ms)
[--] 1 test from TestDumpWithLocalFile (1 ms total)
--
 expected_code
 Which is: 1-byte object <00>
Expected reading file to return OK, but got IOError: NotImplemented: Snappy 
codec support not built
[ FAILED ] TestArrowReaderAdHoc.HandleDictPageOffsetZero (1 ms)
[--] 3 tests from TestArrowReaderAdHoc (2 ms total)

```

 

> [C++] Test suite assumes that Snappy support is built
> -
>
> Key: PARQUET-1856
> URL: https://issues.apache.org/jira/browse/PARQUET-1856
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> The test suite fails if {{-DARROW_WITH_SNAPPY=OFF}}
> {code}
> [--] 1 test from TestStatisticsSortOrder/0, where TypeParam = 
> parquet::PhysicalType<(parquet::Type::type)1>
> [ RUN  ] TestStatisticsSortOrder/0.MinMax
> unknown file: Failure
> C++ exception with description "NotImplemented: Snappy codec support not 
> built" thrown in the test body.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1559) Add way to manually commit already written data to disk

2020-08-06 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172169#comment-17172169
 ] 

Gabor Szadovszky commented on PARQUET-1559:
---

The file writing logic is mainly implemented in ParquetFileWriter. It does not 
invoke a flush to the output stream meaning it is up to the underlying 
OutputStream implementation (in case of Hadoop it is an FSDataOutputStream) 
when to write to the disk. But, this shall be independent from the memory 
footprint of parquet. After writing to the output stream the data of the row 
group should be available for GC. Please note that the related statistics 
(column/offset indexes, other footer values, maybe bloom filters from the next 
release) are still in memory because they will be written when the file gets 
closed.

> Add way to manually commit already written data to disk
> ---
>
> Key: PARQUET-1559
> URL: https://issues.apache.org/jira/browse/PARQUET-1559
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Victor
>Priority: Major
>
> I'm not exactly sure this is compliant with the way parquet works, but I have 
> the following need:
>  * I'm using parquet-avro to write to a parquet file during a long running 
> process
>  * I would like to be able from time to time to access the already written 
> data
> So I was expecting to be able to flush manually the file to ensure the data 
> is on disk and then copy the file for preliminary analysis.
> If it's contradictory to the way parquet works (for example there is 
> something about metadata being at the footer of the file), what would then be 
> the alternative?
> Closing the file and opening a new one to continue writing?
> Could this be supported directly by parquet-mr maybe? It would then write 
> multiple files in that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1559) Add way to manually commit already written data to disk

2020-08-06 Thread wxmimperio (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172137#comment-17172137
 ] 

wxmimperio edited comment on PARQUET-1559 at 8/6/20, 8:41 AM:
--

[~gszadovszky]

Thank you for your answer.

I want to know, if I set up relatively small row groups, refresh the colnum 
store to the page store frequently, and refresh to the outputStream, will the 
data be flushed to disk?(I know the data is unreadable at this time, but the 
column store and page store memory can be released by gc)
 pageStore.flushToFileWriter(parquetFileWriter);
 This method just refreshes page stroe to outputStream, so the data should 
still be in memory at this time until run outputStream.close().

When I reduced rowGroupSize = 8Mb and I find debug log: 
{color:#c1c7d0}LOG.debug("Flushing mem columnStore to file. allocated memory: 
{}", columnStore.getAllocatedSize()),{color} but the file on hdfs has no 
content and size. I guess it is outPutStream did not flush the data out.


was (Author: wxmimperio):
[~gszadovszky]

Thank you for your answer.

I want to know, if I set up relatively small row groups, refresh the colnum 
store to the page store frequently, and refresh to the outputStream, will the 
data be flushed to disk?(I know the data is unreadable at this time, but the 
column store and page store memory can be released by gc)
pageStore.flushToFileWriter(parquetFileWriter);
This method just refreshes page stroe to outputStream, so the data should still 
be in memory at this time until outputStream.close() the data flush to disk.

When I have reduced rowGroupSize = 8Mb and I find debug log: 
LOG.debug("Flushing mem columnStore to file. allocated memory: {}", 
columnStore.getAllocatedSize()), but the file on hdfs has no content and size. 
I guess it is outPutStream did not flush the data out.

> Add way to manually commit already written data to disk
> ---
>
> Key: PARQUET-1559
> URL: https://issues.apache.org/jira/browse/PARQUET-1559
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Victor
>Priority: Major
>
> I'm not exactly sure this is compliant with the way parquet works, but I have 
> the following need:
>  * I'm using parquet-avro to write to a parquet file during a long running 
> process
>  * I would like to be able from time to time to access the already written 
> data
> So I was expecting to be able to flush manually the file to ensure the data 
> is on disk and then copy the file for preliminary analysis.
> If it's contradictory to the way parquet works (for example there is 
> something about metadata being at the footer of the file), what would then be 
> the alternative?
> Closing the file and opening a new one to continue writing?
> Could this be supported directly by parquet-mr maybe? It would then write 
> multiple files in that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1559) Add way to manually commit already written data to disk

2020-08-06 Thread wxmimperio (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172137#comment-17172137
 ] 

wxmimperio commented on PARQUET-1559:
-

[~gszadovszky]

Thank you for your answer.

I want to know, if I set up relatively small row groups, refresh the colnum 
store to the page store frequently, and refresh to the outputStream, will the 
data be flushed to disk?(I know the data is unreadable at this time, but the 
column store and page store memory can be released by gc)
pageStore.flushToFileWriter(parquetFileWriter);
This method just refreshes page stroe to outputStream, so the data should still 
be in memory at this time until outputStream.close() the data flush to disk.

When I have reduced rowGroupSize = 8Mb and I find debug log: 
LOG.debug("Flushing mem columnStore to file. allocated memory: {}", 
columnStore.getAllocatedSize()), but the file on hdfs has no content and size. 
I guess it is outPutStream did not flush the data out.

> Add way to manually commit already written data to disk
> ---
>
> Key: PARQUET-1559
> URL: https://issues.apache.org/jira/browse/PARQUET-1559
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Victor
>Priority: Major
>
> I'm not exactly sure this is compliant with the way parquet works, but I have 
> the following need:
>  * I'm using parquet-avro to write to a parquet file during a long running 
> process
>  * I would like to be able from time to time to access the already written 
> data
> So I was expecting to be able to flush manually the file to ensure the data 
> is on disk and then copy the file for preliminary analysis.
> If it's contradictory to the way parquet works (for example there is 
> something about metadata being at the footer of the file), what would then be 
> the alternative?
> Closing the file and opening a new one to continue writing?
> Could this be supported directly by parquet-mr maybe? It would then write 
> multiple files in that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] ggershinsky commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-06 Thread GitBox


ggershinsky commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-669725393


   > extend ParquetWriteSupport which converts crypto setting in schema..
   
   @shangxinli your previous pull request (already merged) added a very nice 
Encryption Factory interface, that allows for transparent activation of 
encryption via existing parameters (such as schema and hadoop configuration) - 
without making any modifications in the frameworks, including their Parquet 
glue code (write support). The current pull request breaks this model, since it 
introduces new configuration channels, and requires changes in the frameworks 
on top. These changes can also introduce a confusion that schema config changes 
is the way to activate Parquet encryption - while in fact the frameworks can 
activate Parquet encryption via eg the existing Hadoop config (with any proper 
implementation of EncryptionPropertiesFactory, such as 
PropertiesDrivenCryptoFactory). For example, Spark just needs to update its 
Parquet version, no other changes are needed. Lastly, this PR might break 
future interoperability with C++-based frameworks (such as pyarrow and pandas) 
- we're adding a
  high-level encryption support there, based on the standard 
EncryptionPropertiesFactory model.
   
   As mentioned above, this pull request is not a Schema activated encryption. 
The encryption  here is activated by a collection of custom parameter maps, 
piggy-backed on the Schema elements. Since such custom maps don't exist today, 
this PR adds them.
   
   Fortunately, there is a simple solution that allows to support your 
organization's requirements without adding such maps and without breaking the 
Crypto Factory model. Currently, you have your crypto parameters passed in 
custom  maps. But Hadoop config is also a  map. 
You already use it for file-wide parameters (such as encryption algorithm). You 
can also use it for column parameters - by adding them with column names, or 
simply by concatenating them into a single string parameter. See HIVE-21848 for 
an example (or PropertiesDrivenCryptoFactory, that implements HIVE-21848).
   
   > The way it works is SchemaCryptoPropertiesFactory should have RPC call 
with KMS ...
   > We don't want to release helper function like setKey() etc because that 
will require code changes in the existing pipelines. 
   
   You might go a step further, and utilize the PropertiesDrivenCryptoFactory 
directly in your deployment. This factory is designed for efficient KMS 
support, minimizing RPC calls to the absolute minimum. It sets keys and 
metadata without requiring code changes in the existing pipelines. This factory 
also implements the cryptography best practices, eg preventing mistakes like 
using the same data key for many files. Moreover, if using it, you will tap 
into community support and future improvements in this stuff. 
   But, this is only a recommendation; crypto factories are pluggable, and can 
be swapped with any custom implementation, as long as it follows the defined 
model.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org