[
https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804394#comment-13804394
]
Ravikumar commented on BLUR-220:
--------------------------------
Thanks for the pointers on block-cache. I will look to integrate it. May be the
on-heap alloc can use this logic, while the off-heap can continue with the
existing code.
I will create a new issue for the NRT updates.
Lets see the sequence
1. Tiny sorted segments make it to disk from RAM.
2. Future merges take place among already sorted segments.
3. So, inside every segment each row will be co-located with all records. But
still, these rows will be scattered across segments.
4. The SortingMergePolicy impl uses TimSort underneath, which means it is
almost O(n) for already sorted data. Also, this is quite different from a
linear-scan, as merges always try to bulk fetch-and-write data. For actual
comparisons, please look at the details at
https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13605896&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13605896
As per the link, big segment merges are actually quite fast and on-par with
normal merges, provided the index uses no stored-fields. Otherwise, merges will
be 2-3X slower.
Let me know, if you are convinced on this.
> Support for humongous Rows
> --------------------------
>
> Key: BLUR-220
> URL: https://issues.apache.org/jira/browse/BLUR-220
> Project: Apache Blur
> Issue Type: Improvement
> Components: Blur
> Affects Versions: 0.3.0
> Reporter: Aaron McCurry
> Fix For: 0.3.0
>
> Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java,
> CreateIndex.java, CreateSortedIndex.java, FullRowReindexing.java,
> MyEarlyTerminatingCollector.java, SlabAllocator.java, SlabRAMDirectory.java,
> SlabRAMFile.java, SlabRAMInputStream.java, SlabRAMOutputStream.java,
> test_results.txt, TestSearch.java, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the
> number of Records. The current updates are performed on Lucene is by
> deleting the document and re-adding to the index. Unfortunately when any
> update is perform on a Row in Blur, the entire Row has to be re-read (if the
> RowMutationType is UPDATE_ROW) and then whatever modification needs are made
> then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a
> given Row. It may vary based the kind of hardware that is being used, as the
> Row grows in size the indexing (mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this
> problem.
--
This message was sent by Atlassian JIRA
(v6.1#6144)