[jira] [Commented] (PHOENIX-1779) Parallelize fetching of next batch of records for scans corresponding to queries with no order by

Samarth Jain (JIRA) Tue, 14 Apr 2015 12:46:21 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494736#comment-14494736
 ]


Samarth Jain commented on PHOENIX-1779:
---------------------------------------

bq. Can you add a row count check to this test? I noticed your other test has 
that already:
Will do.

bq. In RoundRobinResultIterator, does the iterators.size() change as a result 
of a split having occurred (as otherwise it sounds like a race condition)?
The size doesn't change because of splits. If splits happen before the start of 
query, then BaseResultIterators.getIterators() handles it for us. If splits 
happen after the query has started executing, HBase hides it from us.
The below check is needed because on calling getIterators(), it is possible 
that we might have closed some iterators. 
{code}
+                    // resize and replace the iterators list.
+                    size = openIterators.size();
+                    if (size > 0) {
+                        iterators = getIterators();
+                        // Possible that the number of iterators changed after 
the above call.
+                        size = iterators.size(); 
{code}

bq. I don't think you need the Map<PeekingResultIterator, Integer> in 
RoundRobinResultIterators. You just need two parallel arrays: a 
PeekingResultIterator[] for the open iterators and an int[] with the number of 
records read for each open iterator. The index member variable will index into 
them. When an iterator is exhausted, you just remove that iterator from the 
PeekingResultIterator[] and remove the record read count from the int[]

Having two parallel arrays sounds more complicated that maintaing a map, IMHO. 
Gets and puts are relying on the address of PeekingResultIterator. So it is 
equally performant as compared to using arrays.

bq. I don't think you need the PrefetchedRecordsIterator.
Agreed. Maintaing a separate Tuple array or list will be sufficient. There is 
almost always a way around instance of checks and I trusted you to come up with 
one :).

Will change the method name and add more tests in QueryCompilerTest.


> Parallelize fetching of next batch of records for scans corresponding to 
> queries with no order by 
> --------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-1779
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1779
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Samarth Jain
>            Assignee: Samarth Jain
>         Attachments: PHOENIX-1779.patch, wip.patch, wip3.patch, 
> wipwithsplits.patch
>
>
> Today in Phoenix we parallelize the first execution of scans i.e. we load 
> only the first batch of records up to the scan's cache size in parallel. 
> Loading of subsequent batches of records in scanners is essentially serial. 
> This could be improved especially for queries, including the ones with no 
> order by clauses,  that do not need any kind of merge sort on the client. 
> This could also potentially improve the performance of UPSERT SELECT 
> statements that load data from one table and insert into another. One such 
> use case being creating immutable indexes for tables that already have data. 
> It could also potentially improve the performance of our MapReduce solution 
> for bulk loading data by improving the speed of the loading/mapping phase. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1779) Parallelize fetching of next batch of records for scans corresponding to queries with no order by

Reply via email to