[
https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793423#comment-13793423
]
Aaron McCurry commented on BLUR-220:
------------------------------------
I have attached some prototypes for doing query time joins. Basic results are
as follows:
30,000 Documents
Small query 0.455 ms
Large query 8.119 ms
300,000 Documents
Small query 0.547 ms
Large query 92.168 ms
3,000,000 Documents
Small query 1.428 ms
Large query 3167.428 ms
30,000,000 Documents
Small query 1.698 ms
Large query 64137.19 ms
As I expected the large query, which basically is a hit on all the documents
increases dramatically when the documents in the index increase. So if we are
to move forward with this approach we will need to search different segments in
different ways. Basically if we search segments created from NRT updates with
this approach and search merged segments with the existing approach then we
should have performance pretty close to what it is today with the benefit of
not having to reindex the row for every record mutate.
This approach has 2 main problems to be solved.
The first is the ability to do merges and colocate the records for a given row
during the merge. This will likely require a custom SortingMergePolicy.
The second is the ability to split the logical query into 2 different queries
based on the segment and still get the right answer based on a mixed approach.
This will require some custom query logic that will be based on the existing
SuperQuery object and the lucene-join project.
This will be fairly complex, but if it's solved this will resolve one of the
biggest performance issues in Blur to date.
Aaron
> Support for humongous Rows
> --------------------------
>
> Key: BLUR-220
> URL: https://issues.apache.org/jira/browse/BLUR-220
> Project: Apache Blur
> Issue Type: Improvement
> Components: Blur
> Affects Versions: 0.3.0
> Reporter: Aaron McCurry
> Fix For: 0.3.0
>
> Attachments: CreateIndex.java, test_results.txt, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the
> number of Records. The current updates are performed on Lucene is by
> deleting the document and re-adding to the index. Unfortunately when any
> update is perform on a Row in Blur, the entire Row has to be re-read (if the
> RowMutationType is UPDATE_ROW) and then whatever modification needs are made
> then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a
> given Row. It may vary based the kind of hardware that is being used, as the
> Row grows in size the indexing (mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this
> problem.
--
This message was sent by Atlassian JIRA
(v6.1#6144)