[jira] [Comment Edited] (PHOENIX-1609) MR job to populate index tables

James Taylor (JIRA) Mon, 16 Feb 2015 13:36:06 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323330#comment-14323330
 ]


James Taylor edited comment on PHOENIX-1609 at 2/16/15 9:34 PM:
----------------------------------------------------------------

Thanks for the patch, [[email protected]]. Here's some feedback:
- I think we should aim to build this more directly on top of the MR support 
you already built, in particular on the ability to run a SELECT query through 
PhoenixInputFormat. The main reason is that with functional indexes (see 
http://phoenix.apache.org/secondary_indexing.html#Functional_Indexes), 
arbitrary expressions may be used to define the index which would fit in nicely 
with the mechanism you've already built. Probably the approach that'll give you 
the most bang-for-the-buck would be to expand your MR integration first to 
support *writing* the results from the SELECT to create an HFile (much like the 
CSV loader).
- Once you can write to a table through our MR support, take a look at the 
UPSERT SELECT statement created by PostIndexDDLCompiler to populate an index. 
The SELECT part of this is what you'd want to build as your select statement, 
while the UPSERT part defines the columns to which you're writing. It's 
possible that the building of this statement could be exposed through a shared 
utility (or that you could just use PostIndexDDLCompiler for this work too). If 
you get the QueryPlan for this SELECT statement, you should, in theory, be able 
to run it through your existing MR support (which gets you most of the way 
there).
- I think we should strive to hide the MR job behind our existing CREATE INDEX 
statement. I think you can decide in PostIndexDDLCompiler.compile() on whether 
or not you run the index creation through MR or using our existing mechanism, 
based on the table stats you can retrieve from the data table.  In fact, then 
you'll already have the SELECT statement and UPSERT statement built, so it's 
just a matter of how they'll be run. Something like this:
{code}
    PTableStats stats = dataTableRef.getTable().getTableStats();
    Collection<GuidePostsInfo> guidePostsCollection = 
stats.getGuidePosts().values();
    long totalByteSize = 0;
    for (GuidePostsInfo info : guidePostsCollection) {
        totalByteSize += info.getByteCount();
    }
    long byteThreshold = 
connection.unwrap(PhoenixConnection.class).getQueryServices().
        getProps(QueryServices.MAP_REDUCE_INDEX_BUILD_THRESHOLD_ATTRIB,
            QueryServicesOptions.DEFAULT_MAP_REDUCE_INDEX_BUILD_THRESHOLD);
    if (totalByteSize >= byteThreshold) {
        // Return new MutationPlan that has an execute() method that kicks off 
the map/reduce job
    } else {
        // Return MutationPlan as it is created today
    }
{code}
- As far as setting the index state appropriately, you shouldn't need to do 
anything to initialize the state, as the CREATE INDEX call would set the index 
state at the beginning to a PIndexState.BUILDING from createTableInternal 
already. Then on the successful completion of your MR job, you'd set the index 
state to PIndexState.ACTIVE. It's likely we'll want to move the code that does 
this now in MetaDataClient.buildIndex() into the end of each MutationPlan 
generated there (instead of assuming that the index build always happens 
synchronously).
- Minor, but when validating that a data/index table exists, go through our 
meta data operations using connection.getMetaData() and the corresponding JDBC 
APIs for DatabaseMetaData, instead of dipping down to our internal PTable APIs 
as you've done here:
{code}
+    private boolean isValidIndexTable(final Connection connection, final 
String masterTable, final String indexTable) throws SQLException {
+        final PTable table = PhoenixRuntime.getTable(connection, masterTable);
+        for(PTable indxTable : table.getIndexes()){
+            
if(indxTable.getTableName().getString().equalsIgnoreCase(indexTable)) {
+                return true;
+            }
+        }
+        return false;
+        
+    }
+    
{code}


was (Author: jamestaylor):
Thanks for the patch, [[email protected]]. Here's some feedback:
- I think we should aim to build this more directly on top of the MR support 
you already built, in particular on the ability to run a SELECT query through 
PhoenixInputFormat. The main reason is that with functional indexes (see 
http://phoenix.apache.org/secondary_indexing.html#Functional_Indexes), 
arbitrary expressions may be used to define the index which would fit in nicely 
with the mechanism you've already built. Probably the approach that'll give you 
the most bang-for-the-buck would be to expand your MR integration first to 
support *writing* the results from the SELECT to create an HFile (much like the 
CSV loader).
- Once you can write to a table through our MR support, take a look at the 
UPSERT SELECT statement created by PostIndexDDLCompiler to populate an index. 
The SELECT part of this is what you'd want to build as your select statement, 
while the UPSERT part defines the columns to which you're writing. It's 
possible that the building of this statement could be exposed through a shared 
utility (or that you could just use PostIndexDDLCompiler for this work too). If 
you get the QueryPlan for this SELECT statement, you should, in theory, be able 
to run it through your existing MR support (which gets you most of the way 
there).
- I think we should strive to hide the MR job behind our existing CREATE INDEX 
statement. I think you can decide in PostIndexDDLCompiler.compile() on whether 
or not you run the index creation through MR or using our existing mechanism, 
based on the table stats you can retrieve from the data table.  In fact, then 
you'll already have the SELECT statement and UPSERT statement built, so it's 
just a matter of how they'll be run. Something like this:
{code}
    PTableStats stats = dataTableRef.getTable().getTableStats();
    Collection<GuidePostsInfo> guidePostsCollection = 
stats.getGuidePosts().values();
    long totalByteSize = 0;
    for (GuidePostsInfo info : guidePostsCollection) {
        totalByteSize += info.getByteCount();
    }
    long byteThreshold = 
connection.unwrap(PhoenixConnection.class).getQueryServices().
        getProps(QueryServices.MAP_REDUCE_INDEX_BUILD_THRESHOLD_ATTRIB,
            QueryServicesOptions.DEFAULT_MAP_REDUCE_INDEX_BUILD_THRESHOLD);
    if (totalByteSize >= byteThreshold) {
        // Return new MutationPlan that has an execute() method that kicks off 
the map/reduce job
    } else {
        // Return MutationPlan as it is created today
    }
{code}
- As far as setting the index state appropriately, you shouldn't need to do 
anything to initialize the state, as the CREATE INDEX call would set the index 
state at the beginning to a PIndexState.BUILDING from createTableInternal 
already. Then on the successful completion of your MR job, you'd set the index 
state to PIndexState.ACTIVE. It's likely we'll want to move the code that does 
this now in MetaDataClient.buildIndex() into the end of each MutationPlan 
generated there (instead of assuming that the index build always happens 
synchronously).
- Minor, but when validating that a data/index table exists, go through our 
meta data operations using connection.getMetaData() and the corresponding JDBC 
APIs for DatabaseMetaData, instead of dipping down to our internal PTable APIs 
as you've done here:
{code}
+    private boolean isValidIndexTable(final Connection connection, final 
String masterTable, final String indexTable) throws SQLException {
+        final PTable table = PhoenixRuntime.getTable(connection, masterTable);
+        for(PTable indxTable : table.getIndexes()){
+            
if(indxTable.getTableName().getString().equalsIgnoreCase(indexTable)) {
+                return true;
+            }
+        }
+        return false;
+        
+    }
+    

> MR job to populate index tables 
> --------------------------------
>
>                 Key: PHOENIX-1609
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1609
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: maghamravikiran
>            Assignee: maghamravikiran
>         Attachments: 0001-PHOENIX_1609.patch
>
>
> Often, we need to create new indexes on master tables way after the data 
> exists on the master tables.  It would be good to have a simple MR job given 
> by the phoenix code that users can call to have indexes in sync with the 
> master table. 
> Users can invoke the MR job using the following command 
> hadoop jar org.apache.phoenix.mapreduce.Index -st MASTER_TABLE -tt 
> INDEX_TABLE -columns a,b,c
> Is this ideal? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (PHOENIX-1609) MR job to populate index tables

Reply via email to