[ 
https://issues.apache.org/jira/browse/PHOENIX-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324812#comment-14324812
 ] 

Gabriel Reid commented on PHOENIX-1609:
---------------------------------------

Good points [~jamestaylor]. I think the main difference (which may just be an 
artificial difference) is that typically MR jobs are started via the hadoop 
command, and (typically) the hadoop command is already configured to allow 
starting jobs.

In terms of making sure that a Phoenix client is configured to be able to run a 
MR job, basically the certain way to make it work is to ensure that either the 
system's mapred-site.xml (and likely core-site.xml) are on the classpath, or 
that the relevant contents of these files (i.e. where to find the jobtracker, 
or YARN resourcemanager, and probably where to find the namenode) are present 
in the configuration object used to launch the job (setting up this classpath 
is basically all the "hadoop jar" command does). We'd also have to look into 
what exactly needs to be included in terms of dependencies in phoenix to kick 
off the job. Most of it is probably already in there, but there are likely some 
deps for actually submitting the job to that would need to be added.

I can definitely see how this would be better for the users if this works. My 
main concern is that there's a good chance it won't work by default (i.e. it'll 
always require configuring things for submitting MR jobs). Related to this, 
it's a lot easier to debug configuration issues when people have access to the 
hadoop command on their system (and are using the hadoop command for starting 
jobs) than to debug job submission issues when the job submission is within 
another system (e.g. a JDBC driver).

Another idea which might make things easier for general use, although would 
require some extra setup, would be to store the relevant configuration 
information for job submission in Zookeeper somewhere, and retrieve it from ZK 
when submitting jobs instead of expecting it to be in a configuration file. 
This would obviously require getting that information there somehow in the 
first place, but it would then allow for someone who just knows the phoenix 
jdbc url to still be able to create indexes and kick off the relevant MR job.

> MR job to populate index tables 
> --------------------------------
>
>                 Key: PHOENIX-1609
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1609
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: maghamravikiran
>            Assignee: maghamravikiran
>         Attachments: 0001-PHOENIX_1609.patch
>
>
> Often, we need to create new indexes on master tables way after the data 
> exists on the master tables.  It would be good to have a simple MR job given 
> by the phoenix code that users can call to have indexes in sync with the 
> master table. 
> Users can invoke the MR job using the following command 
> hadoop jar org.apache.phoenix.mapreduce.Index -st MASTER_TABLE -tt 
> INDEX_TABLE -columns a,b,c
> Is this ideal? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to