svjack opened a new issue #9026:
URL: https://github.com/apache/arrow/issues/9026


   i used pyarrow to handle hdfs files in hive. And i review the source code of 
pyarrow.
   The mainly utilities about hdfs filesystem are function's about parquet, 
many about io and meta or schema inferred which is rich to use it.
   Another aspect is plain read function , read as text to manipulate text file 
in hdfs file system.
   as i know if i create table in hive by default the save format is text. and 
when i use HdfsFileSystem to deep into the truly path in hdfs of hive. It seems 
like the schema and meta info (and the auto parsing of delimited lines)of table 
can't retrieved by internal api.
   There i don't want to use sql tools as pyhive or others to make it as a "two 
source"(one from abstract sql another from plain file system) problem. even it 
is simple.
   So at present, i must use pd.read_csv with the f returned by fs.open and 
retrieve schema info from mysql's TBLS where the detail schema info truly 
located of hive metastore. I think this design is not perfect.
   So i want to know is that, did i omit some details about the underlying 
logic about pyarrow related with hdfs file system in hive ? please make a 
interpretation for me.
   All about this is pyarrow internal construction 
   instead of other framework.
   And i also want to have a brief introduction about dataset api 's function 
about hive's parquet file and text file. Can you give me some examples about 
them, mainly about text save format in hive's hdfs.
   I also take a glare a datas source transport toolkit called sqoop, in its 
AppendUtils.java file it use some detail partition manipulates toolkit to 
perform data append and i think all functions can be rebuilder with pyarrow. 
But as i review the source code about pyarrow , i can not find some developed 
logic about "partition" and 'warehouse' manipulation. Did some one build some 
projects use pyarrow or arrow's other api which have implement these function ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to