[jira] [Commented] (DRILL-5365) FileNotFoundException when reading a parquet file

ASF GitHub Bot (JIRA) Thu, 31 May 2018 17:51:20 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-5365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16497412#comment-16497412
 ]


ASF GitHub Bot commented on DRILL-5365:
---------------------------------------

paul-rogers commented on issue #1296: DRILL-5365: Prevent plugin config from 
changing default fs. Make DrillFileSystem Immutable.
URL: https://github.com/apache/drill/pull/1296#issuecomment-393726044
 
 
   @ilooner, the fix will probably work, but seems a bit of a hack. It is not 
clear why Hive changes the default file system, though we can speculate. Drill 
likely uses Hive as a metastore. (Later versions of Hive split out the 
metastore into Hive Meta Store or HMS, so, ideally, that's what Drill would 
call it...)
   
   t is unfortunate that Drill requires users to copy their Hive config from a 
file into a Drill JSON config object. Two copies, in two formats, is typically 
frowned upon.
   
   So, we have HMS, describing data on disk. Presumably HMS wants to state the 
file system that contains the file, and does so as part of its config. 
   
   Now, we want readers to read those files. One would expect that the "hive" 
storage plugin replace Drill's format plugin mechanisms with its own 
file-to-format mapping. When reading HMS files, Drill would use the "hive" 
options and formats. Is that how it works? Not sure.
   
   I believe that "hive" has its own set of readers. I've seen indications that 
one cannot, say, use the Drill native CSV or Parquet readers (say) for "hive" 
files. The question is, how is all this wired together. (I don't know, haven't 
looked at this code.)
   
   Supposedly, if we query a "hive" table, we need to use the "hive" file 
system info.
   
   Where does this "hive" info leak into a non-"hive" reader? Perhaps joining a 
"hive" file with a "dfs" file?
   
   With this fix, would that join work?
   
   In short, I think we need to understand the above to ensure we don't 
actually play Whack-a-Mole and introduce a new bug by fixing another.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> FileNotFoundException when reading a parquet file
> -------------------------------------------------
>
>                 Key: DRILL-5365
>                 URL: https://issues.apache.org/jira/browse/DRILL-5365
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Hive
>    Affects Versions: 1.10.0
>            Reporter: Chun Chang
>            Assignee: Timothy Farkas
>            Priority: Major
>             Fix For: 1.14.0
>
>
> The parquet file is generated through the following CTAS.
> To reproduce the issue: 1) two or more nodes cluster; 2) enable 
> impersonation; 3) set "fs.default.name": "file:///" in hive storage plugin; 
> 4) restart drillbits; 5) as a regular user, on node A, drop the table/file; 
> 6) ctas from a large enough hive table as source to recreate the table/file; 
> 7) query the table from node A should work; 8) query from node B as same user 
> should reproduce the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-5365) FileNotFoundException when reading a parquet file

Reply via email to