[
https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804223#comment-13804223
]
Aaron McCurry commented on BLUR-220:
------------------------------------
We should probably look to integrate the the slab feature into the block cache
subsystem in Blur, there already a lot of logic there for off heap allocation
and management. It's integrated into the CacheDirectory (v2) if you want to a
look.
I have prototyped some logic that actually uses the RAMDirectory with a swap
out mechanize like you have described above. I got good results with the NRT
updating. Opening the index on avg ~1ms with an update rate of one a ms.
However I think that what we are talking about here relates more to NRT updates
than huge rows. I do have a concern about your proposed FilteredReader, it's a
performance concern.
Let's say that we go to update a row by adding a single record to it. And we
have to merge the existing records from a row that exists in a large segment.
Say 5 million documents with 50 million terms, the FilteredReader will have to
walk the entire field -> term -> doc -> position tree to locate the pieces of
the index that are related to the row in question. It's like a full table
scan. Right?
I would like to continue the thread on the changes to the NRT updates
(RAMDirectory thing swap) but we should create a new issue to continue the
discussion.
Thanks!
Aaron
> Support for humongous Rows
> --------------------------
>
> Key: BLUR-220
> URL: https://issues.apache.org/jira/browse/BLUR-220
> Project: Apache Blur
> Issue Type: Improvement
> Components: Blur
> Affects Versions: 0.3.0
> Reporter: Aaron McCurry
> Fix For: 0.3.0
>
> Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java,
> CreateIndex.java, CreateSortedIndex.java, FullRowReindexing.java,
> MyEarlyTerminatingCollector.java, SlabAllocator.java, SlabRAMDirectory.java,
> SlabRAMFile.java, SlabRAMInputStream.java, SlabRAMOutputStream.java,
> test_results.txt, TestSearch.java, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the
> number of Records. The current updates are performed on Lucene is by
> deleting the document and re-adding to the index. Unfortunately when any
> update is perform on a Row in Blur, the entire Row has to be re-read (if the
> RowMutationType is UPDATE_ROW) and then whatever modification needs are made
> then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a
> given Row. It may vary based the kind of hardware that is being used, as the
> Row grows in size the indexing (mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this
> problem.
--
This message was sent by Atlassian JIRA
(v6.1#6144)