[
https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ravikumar updated BLUR-220:
---------------------------
Attachment: Blur_Query_Perf_Chart1.pdf
MyEarlyTerminatingCollector.java
CreateIndex.java
CreateSortedIndex.java
TestSearch.java
I have modified the test case a little bit.
We have a table of results that I have attached along with the test-cases.
I have assumed that IDs in this test-case correspond to RowIds of Blur.
"Unsorted" --> Scatter records across segments
"Optimize"--> Optimize every index into one single segment. All data is present
in one single segment
"Sort"--> Use SortMergePolicy and locate IDs together in some of the segments
"SortEarlyTerm" --> Same as above, but during search early-terminate already
sorted segments
What do the results show?
1. "SortEarlyTerm" is quite powerful, when the number of rowIds are
small{<=10K} in number
2. As the rowIds increases, the optimized single segment outperforms
everything else, which is understandable.
3. There is a slight difference in results between an early-term and fully
completing queries on sorted segments. I guess this is open to interpretations.
There are 2 issues in Early-Term to keep in mind.
1. It can be done, only by throwing an exception per-segment. This is way too
ugly and may be a tad costly also.
2. All docs of a row are not examined. Hence scoring per-row is wrong.
> Support for humongous Rows
> --------------------------
>
> Key: BLUR-220
> URL: https://issues.apache.org/jira/browse/BLUR-220
> Project: Apache Blur
> Issue Type: Improvement
> Components: Blur
> Affects Versions: 0.3.0
> Reporter: Aaron McCurry
> Fix For: 0.3.0
>
> Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java,
> CreateIndex.java, CreateSortedIndex.java, MyEarlyTerminatingCollector.java,
> test_results.txt, TestSearch.java, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the
> number of Records. The current updates are performed on Lucene is by
> deleting the document and re-adding to the index. Unfortunately when any
> update is perform on a Row in Blur, the entire Row has to be re-read (if the
> RowMutationType is UPDATE_ROW) and then whatever modification needs are made
> then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a
> given Row. It may vary based the kind of hardware that is being used, as the
> Row grows in size the indexing (mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this
> problem.
--
This message was sent by Atlassian JIRA
(v6.1#6144)