Josh Mahonin commented on PHOENIX-4490:

FWIW, I think there should be a more elegant solution here. It would be nice if 
theseĀ sorts of parameters could be passed in as options to the Dataframe / 
Dataset builder, and then carried forward as needed.

As I recall, the Configuration object itself isĀ _not_ Serializable, which is a 
big challenge for Spark, and why it gets re-created several times within the 
phoenix-spark module. Perhaps there's another solution for that problem we 
could leverage?

Glad there's a workaround, but if anyone has time for a patch to the underlying 
issue, that would be fantastic!

> Phoenix Spark Module doesn't pass in user properties to create connection
> -------------------------------------------------------------------------
>                 Key: PHOENIX-4490
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4490
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Karan Mehta
>            Priority: Major
> Phoenix Spark module doesn't work perfectly in a Kerberos environment. This 
> is because whenever new {{PhoenixRDD}} are built, they are always built with 
> new and default properties. The following piece of code in 
> {{PhoenixRelation}} is an example. This is the class used by spark to create 
> {{BaseRelation}} before executing a scan. 
> {code}
>     new PhoenixRDD(
>       sqlContext.sparkContext,
>       tableName,
>       requiredColumns,
>       Some(buildFilter(filters)),
>       Some(zkUrl),
>       new Configuration(),
>       dateAsTimestamp
>     ).toDataFrame(sqlContext).rdd
> {code}
> This would work fine in most cases if the spark code is being run on the same 
> cluster as HBase, the config object will pickup properties from Class path 
> xml files. However in an external environment we should use the user provided 
> properties and merge them before creating any {{PhoenixRelation}} or 
> {{PhoenixRDD}}. As per my understanding, we should ideally provide properties 
> in {{DefaultSource#createRelation() method}}.
> An example of when this fails is, Spark tries to get the splits to optimize 
> the MR performance for loading data in the table in 
> {{PhoenixInputFormat#generateSplits()}} methods. Ideally, it should get all 
> the config parameters from the {{JobContext}} being passed, but it is 
> defaulted to {{new Configuration()}}, irrespective of what user passes in. 
> Thus it fails to create a connection.
> [~jmahonin] [~maghamraviki...@gmail.com] 
> Any ideas or advice? Let me know if I am missing anything obvious here.

This message was sent by Atlassian JIRA

Reply via email to