[
https://issues.apache.org/jira/browse/PHOENIX-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323330#comment-14323330
]
James Taylor edited comment on PHOENIX-1609 at 2/16/15 9:34 PM:
----------------------------------------------------------------
Thanks for the patch, [[email protected]]. Here's some feedback:
- I think we should aim to build this more directly on top of the MR support
you already built, in particular on the ability to run a SELECT query through
PhoenixInputFormat. The main reason is that with functional indexes (see
http://phoenix.apache.org/secondary_indexing.html#Functional_Indexes),
arbitrary expressions may be used to define the index which would fit in nicely
with the mechanism you've already built. Probably the approach that'll give you
the most bang-for-the-buck would be to expand your MR integration first to
support *writing* the results from the SELECT to create an HFile (much like the
CSV loader).
- Once you can write to a table through our MR support, take a look at the
UPSERT SELECT statement created by PostIndexDDLCompiler to populate an index.
The SELECT part of this is what you'd want to build as your select statement,
while the UPSERT part defines the columns to which you're writing. It's
possible that the building of this statement could be exposed through a shared
utility (or that you could just use PostIndexDDLCompiler for this work too). If
you get the QueryPlan for this SELECT statement, you should, in theory, be able
to run it through your existing MR support (which gets you most of the way
there).
- I think we should strive to hide the MR job behind our existing CREATE INDEX
statement. I think you can decide in PostIndexDDLCompiler.compile() on whether
or not you run the index creation through MR or using our existing mechanism,
based on the table stats you can retrieve from the data table. In fact, then
you'll already have the SELECT statement and UPSERT statement built, so it's
just a matter of how they'll be run. Something like this:
{code}
PTableStats stats = dataTableRef.getTable().getTableStats();
Collection<GuidePostsInfo> guidePostsCollection =
stats.getGuidePosts().values();
long totalByteSize = 0;
for (GuidePostsInfo info : guidePostsCollection) {
totalByteSize += info.getByteCount();
}
long byteThreshold =
connection.unwrap(PhoenixConnection.class).getQueryServices().
getProps(QueryServices.MAP_REDUCE_INDEX_BUILD_THRESHOLD_ATTRIB,
QueryServicesOptions.DEFAULT_MAP_REDUCE_INDEX_BUILD_THRESHOLD);
if (totalByteSize >= byteThreshold) {
// Return new MutationPlan that has an execute() method that kicks off
the map/reduce job
} else {
// Return MutationPlan as it is created today
}
{code}
- As far as setting the index state appropriately, you shouldn't need to do
anything to initialize the state, as the CREATE INDEX call would set the index
state at the beginning to a PIndexState.BUILDING from createTableInternal
already. Then on the successful completion of your MR job, you'd set the index
state to PIndexState.ACTIVE. It's likely we'll want to move the code that does
this now in MetaDataClient.buildIndex() into the end of each MutationPlan
generated there (instead of assuming that the index build always happens
synchronously).
- Minor, but when validating that a data/index table exists, go through our
meta data operations using connection.getMetaData() and the corresponding JDBC
APIs for DatabaseMetaData, instead of dipping down to our internal PTable APIs
as you've done here:
{code}
+ private boolean isValidIndexTable(final Connection connection, final
String masterTable, final String indexTable) throws SQLException {
+ final PTable table = PhoenixRuntime.getTable(connection, masterTable);
+ for(PTable indxTable : table.getIndexes()){
+
if(indxTable.getTableName().getString().equalsIgnoreCase(indexTable)) {
+ return true;
+ }
+ }
+ return false;
+
+ }
+
{code}
was (Author: jamestaylor):
Thanks for the patch, [[email protected]]. Here's some feedback:
- I think we should aim to build this more directly on top of the MR support
you already built, in particular on the ability to run a SELECT query through
PhoenixInputFormat. The main reason is that with functional indexes (see
http://phoenix.apache.org/secondary_indexing.html#Functional_Indexes),
arbitrary expressions may be used to define the index which would fit in nicely
with the mechanism you've already built. Probably the approach that'll give you
the most bang-for-the-buck would be to expand your MR integration first to
support *writing* the results from the SELECT to create an HFile (much like the
CSV loader).
- Once you can write to a table through our MR support, take a look at the
UPSERT SELECT statement created by PostIndexDDLCompiler to populate an index.
The SELECT part of this is what you'd want to build as your select statement,
while the UPSERT part defines the columns to which you're writing. It's
possible that the building of this statement could be exposed through a shared
utility (or that you could just use PostIndexDDLCompiler for this work too). If
you get the QueryPlan for this SELECT statement, you should, in theory, be able
to run it through your existing MR support (which gets you most of the way
there).
- I think we should strive to hide the MR job behind our existing CREATE INDEX
statement. I think you can decide in PostIndexDDLCompiler.compile() on whether
or not you run the index creation through MR or using our existing mechanism,
based on the table stats you can retrieve from the data table. In fact, then
you'll already have the SELECT statement and UPSERT statement built, so it's
just a matter of how they'll be run. Something like this:
{code}
PTableStats stats = dataTableRef.getTable().getTableStats();
Collection<GuidePostsInfo> guidePostsCollection =
stats.getGuidePosts().values();
long totalByteSize = 0;
for (GuidePostsInfo info : guidePostsCollection) {
totalByteSize += info.getByteCount();
}
long byteThreshold =
connection.unwrap(PhoenixConnection.class).getQueryServices().
getProps(QueryServices.MAP_REDUCE_INDEX_BUILD_THRESHOLD_ATTRIB,
QueryServicesOptions.DEFAULT_MAP_REDUCE_INDEX_BUILD_THRESHOLD);
if (totalByteSize >= byteThreshold) {
// Return new MutationPlan that has an execute() method that kicks off
the map/reduce job
} else {
// Return MutationPlan as it is created today
}
{code}
- As far as setting the index state appropriately, you shouldn't need to do
anything to initialize the state, as the CREATE INDEX call would set the index
state at the beginning to a PIndexState.BUILDING from createTableInternal
already. Then on the successful completion of your MR job, you'd set the index
state to PIndexState.ACTIVE. It's likely we'll want to move the code that does
this now in MetaDataClient.buildIndex() into the end of each MutationPlan
generated there (instead of assuming that the index build always happens
synchronously).
- Minor, but when validating that a data/index table exists, go through our
meta data operations using connection.getMetaData() and the corresponding JDBC
APIs for DatabaseMetaData, instead of dipping down to our internal PTable APIs
as you've done here:
{code}
+ private boolean isValidIndexTable(final Connection connection, final
String masterTable, final String indexTable) throws SQLException {
+ final PTable table = PhoenixRuntime.getTable(connection, masterTable);
+ for(PTable indxTable : table.getIndexes()){
+
if(indxTable.getTableName().getString().equalsIgnoreCase(indexTable)) {
+ return true;
+ }
+ }
+ return false;
+
+ }
+
> MR job to populate index tables
> --------------------------------
>
> Key: PHOENIX-1609
> URL: https://issues.apache.org/jira/browse/PHOENIX-1609
> Project: Phoenix
> Issue Type: New Feature
> Reporter: maghamravikiran
> Assignee: maghamravikiran
> Attachments: 0001-PHOENIX_1609.patch
>
>
> Often, we need to create new indexes on master tables way after the data
> exists on the master tables. It would be good to have a simple MR job given
> by the phoenix code that users can call to have indexes in sync with the
> master table.
> Users can invoke the MR job using the following command
> hadoop jar org.apache.phoenix.mapreduce.Index -st MASTER_TABLE -tt
> INDEX_TABLE -columns a,b,c
> Is this ideal?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)