Jesse Lord created ARROW-4874:
---------------------------------
Summary: Cannot read parquet from encrypted hdfs
Key: ARROW-4874
URL: https://issues.apache.org/jira/browse/ARROW-4874
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.12.0
Environment: cloudera yarn cluster, red hat enterprise 7
Reporter: Jesse Lord
Using pyarrow 0.12 I was able to read parquet at first and then the admins
added KMS servers and encrypted all of the files on the cluster. Now I get an
error and the file system object can only read objects from the local file
system of the edge node.
Reproducible example:
{{import pyarrow as pa fs = pa.hdfs.connect() with
fs.open('/user/jlord/test_lots_of_parquet/', 'rb') as fil: _ = fil.read() }}
error:
{{19/03/14 10:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
hdfsOpenFile(/user/jlord/test_lots_of_parquet/):
FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;)
error: FileNotFoundException: File /user/jlord/test_lots_of_parquet does not
existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet does
not exist at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344)
Traceback (most recent call last): File "local_hdfs.py", line 15, in <module>
with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in
pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist:
/user/jlord/test_lots_of_parquet/}}
If I specify a specific parquet file in that folder I get the following error:
{{19/03/14 10:07:32 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
hdfsOpenFile(/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet):
FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;)
error: FileNotFoundException: File
/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet
does not existjava.io.FileNotFoundException: File
/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet
does not exist at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344)
Traceback (most recent call last): File "local_hdfs.py", line 15, in <module>
with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in
pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist:
/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet}}
Not sure if this is relevant: spark can read continue to read the parquet
files, but it takes a cloudera specific version that can read the following KMS
keys from the core-site.xml and hdfs-site.xml:
{{<property> <name>dfs.encryption.key.provider.uri</name>
<value>kms://[email protected];server2.com:16000/kms</valu e> </property>}}
Using the open source version of spark requires changing these xml values to:
{{<property> <name>dfs.encryption.key.provider.uri</name>
<value>kms://[email protected]:16000/kms</value>
<value>kms://[email protected]:16000/kms</value> </property>}}
Might need to point arrow to separate configuration xmls.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)