Re: [Question] HiveStoragePlugin and NativeParquetRowGroupScan

Paul Rogers Fri, 24 Aug 2018 17:11:59 -0700

Hi Tim,

Can't recall the details on this. The phrase "the filesystem configuration" 
might be misleading. When executing, Drill must support multiple filesystems. I 
can have two different DFS configs, pointing to two different HDFS clusters 
(say) in a single query:


SELECT ... FROM dfs1.`aFile.csv`, dfs2.`anotherFile.csv`

We'd create separate readers for each file. Each reader should have a different 
filesystem conf: the one appropriate for the storage plugin config used for 
that file.

Using that as a reference, it would seem that Hive plugin queries use the hive 
fs, while any DFS tables in the same query use the DFS config.

I wonder, based on your comment, is this not happening? Are the configs getting 
muddled somehow?

Thanks,
- Paul

 

    On Friday, August 24, 2018, 3:45:08 PM PDT, Timothy Farkas 
<[email protected]> wrote:  
 
 Hi Paul / Vitalii

Thanks for the info. I was asking about this because of
https://issues.apache.org/jira/browse/DRILL-6609 in which some strange
behavior was observed if the user defined fs.default.name in the HivePlugin
config. I also saw that the filesystem specified in the HivePlugin config
influences the FileSystem used for native scans. This happens because in
HiveDrillNativeParquetRowGroupScan.getFsConf we use the HiveStoragePlugin
to create the filesystem configuration, which is then used by
DrillFileSystem.

However, based on your feedback it looks like this is desirable behavior,
since the user may want to define a different filesystem for the HivePlugin
along with different format plugins. Which means the root cause of
https://issues.apache.org/jira/browse/DRILL-6609 is something else then.
I'll probably abandon that issue at this point since it's not reproducible
and I have no further leads as to what could cause it.

Thanks,
Tim

On Thu, Aug 23, 2018 at 2:46 AM, Vitalii Diravka <[email protected]>
wrote:

> Hi Tim,
>
> Some comments from me.
>
> *HiveStoragePlugin*
> *fs.defaultFS *is Hive specific property. This is the URI used by Hive
> Metastore to point where tables are placed. There is no need to specify
> this property, if default value from *core-site.xml* is acceptable, see
> more:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.
> apache.org_docs_r3.1.0_hadoop-2Dproject-2Ddist_hadoop-
> 2Dcommon_core-2Ddefault.xml&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> 4eQVr8zB8ZBff-yxTimdOQ&m=Y3D0V12MikEpxfG9ybUeW6KLgeJcCD
> N8jXEur5IyORo&s=iJjg-o08kFjMfaxGHOZ9QAiTnk2KhkwPofQ3jEVjtyw&e=
>
> *Hive Native readers. *
> Currently Drill has two Hive Native readers: Parquet and MapR Json. Both of
> them use appropriate default File Format Plugins. It is a limitation and
> there is no way for now to change FormatPlugins config for them.
> There is Jira ticket for it:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
> apache.org_jira_browse_DRILL-2D6621&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> 4eQVr8zB8ZBff-yxTimdOQ&m=Y3D0V12MikEpxfG9ybUeW6KLgeJcCDN8jXEur5IyORo&s=
> QDZyPZEwolNN1wu5z4QMwajvdQ3iQPPQ0yycxhUUKw0&e=
>
>
> Kind regards
> Vitalii
>
>
> On Thu, Aug 23, 2018 at 3:02 AM Paul Rogers <[email protected]>
> wrote:
>
> > Hi Tim,
> >
> > I don't have an answer. But, I can point out some factors to consider.
> >
> > Hive describes a set of data in a specific file system. Would make sense
> > to associate that file system with the Hive configuration. Else, I could
> > use a Hive metastore for FS A, with a DFS configured for FS B, and have
> > nothing work for reasons that would be hard to figure out.
> >
> > Further, isn't Hive its own storage plugin, and thus would be referenced
> > as, say, "myHive.customers"? What would be the implied relationship
> between
> > the Hive plugin config and the DFS plugin config?
> >
> > Suppose I had two Hive plugin configs, Hive1 and Hive2. And, two DFS
> > configs: DFS1 and DFS2. What is the implied relationship (if any) between
> > Hive1 and either DFS1 or DFS2? Between Hive2 and DFS1 or DFS2?
> >
> > Given these ambiguities, it would seem to explain why Hive's HDFS URL is
> > configured with Hive and is distinct from other a similar HDFS URL
> defined
> > for DFS.
> >
> > Can you suggest a way to avoid duplication and link the two? Perhaps, in
> > Hive config, name a DFS config rather than duplicating the HDFS config
> for
> > Hive?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Wednesday, August 22, 2018, 4:41:37 PM PDT, Timothy Farkas <
> > [email protected]> wrote:
> >
> >  Hi All,
> >
> > I'm a bit confused and I was hoping to get some clarification about how
> the
> > HiveStoragePlugin interacts with the FileSystem plugin. Currently the
> > HiveStoragePlugin allows the user to configure their own value for
> > fs.defaultFS in the plugin properties, which overrides the defaultFS used
> > when doing a native parquet scan for Hive. Is this intentional? Also what
> > is the high level theory about how Hive and the FileSystem plugins
> > interact? Specifically does Drill support querying Hive when Hive is
> using
> > a different FileSystem than the one specified in the file system plugin?
> Or
> > does Drill assume that the Hive is using the same FileSystem as the one
> > defined in the Drill FileSystem plugin?
> >
> > Thanks,
> > Tim
> >
>

Re: [Question] HiveStoragePlugin and NativeParquetRowGroupScan

Reply via email to