[ 
https://issues.apache.org/jira/browse/PHOENIX-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930690#comment-16930690
 ] 

Kadir OZDEMIR commented on PHOENIX-5362:
----------------------------------------

[~gjacoby] and [~ckulkarni], Have you noticed that PHOENIX-5018 uses 
PhoenixServerBuildIndexInputFormat which generates the plan once in 
getQueryPlan() and returns the same plan in the subsequent calls of 
getQueryPlan()?

public class PhoenixServerBuildIndexInputFormat<T extends DBWritable> extends 
PhoenixInputFormat {
QueryPlan queryPlan = null;

...

@Override
protected QueryPlan getQueryPlan(final JobContext context, final Configuration 
configuration)
 throws IOException {
 Preconditions.checkNotNull(context);
 if (queryPlan != null) {
 return queryPlan;
 }

...

> Mappers should use the queryPlan from the driver rather than regenerating the 
> plan
> ----------------------------------------------------------------------------------
>
>                 Key: PHOENIX-5362
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5362
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Chinmay Kulkarni
>            Priority: Major
>             Fix For: 4.15.1, 5.1.1
>
>
> Currently, PhoenixInputFormat#getQueryPlan already generates a queryPlan and 
> we use this plan to get the scans and splits for the MR job. In 
> PhoenixInputFormat#createRecordReader which is called inside each mapper, we 
> again create a queryPlan and pass this to the PhoenixRecordReader instance.
> There are multiple problems with this approach:
> # The mappers already have information about the scans from the driver code. 
> We potentially just need to wrap these scans in an iterator and create a 
> subsequent ResultSet.
> # The mappers don't need most of the information embedded within a queryPlan, 
> so they shouldn't need to regenerate the plan.
> # There are weird corner cases that can occur if we replan the query in each 
> mapper. For ex: If there is an index creation or metadata change in between 
> when the MR job was created, and when the Mappers actually launch. In this 
> case, the mappers have the scans created for the first queryPlan, but the 
> mappers will use iterators created for the second queryPlan. In such cases, 
> the issued scans would not match the queryPlan embedded in the mappers' 
> iterators/ResultSet. We could potentially miss some scans/be looking for more 
> than we actually require since we check the original scans for this size. The 
> resolved table would be as per the new queryPlan, and there could be a 
> mismatch here as well (considering the index creation case you mentioned). 
> There are potentially other repercussions in case of intermediary metadata 
> changes as well.
> Serializing a subset of the information (like the projector, which iterator 
> to use, etc.) of a QueryPlan and passing it from the driver to the mappers 
> without having them regenerate the plans seems like the best way forward.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to