[
https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800557#comment-13800557
]
Ravikumar commented on BLUR-220:
--------------------------------
I am describing the 10,000 ft-approach of things.
1. Lets have 2 RAMDirectories per-shard, in each shard-server.
One for buffering incoming documents[Active-RAM] and another for
merge-sorting and flushing to HDFS. [FlushableRAM]
2. Based on number of documents added or absolute bytes consumed Active-RAM
gets swapped with the FlushableRAM per-shard.
3. For each incoming mutation, add mutation to Active-RAM and delete that
mutation from FlushableRAM and HDFS indexes.
a. getActiveRAMIndexWriter().updateDocuments(List<Documents>); [Contains
all record-specific mutations per-row]
b. getFlushableRAMIndexWriter().delete(Query....
rowIdAndRecordIdQueries); [A set-of-queries containing rowId & recordId terms]
c. getHDFSIndexWriter().delete(Query.... rowIdAndRecordIdQueries);
d. Record in Blur Transaction Log.
4. Step 3 continues, until step 2 is violated. A swap of Active-RAM &
FlushableRAM happens. Background thread starts merge-sorting and flushing, from
FlushableRAM to HDFS, to co-locate all records of a row.
5. It is highly likely that when Flushing happens, deletes will arrive to
FlushableRAM index by way of Step 3. These are accumulated in a DeleteQueue and
committed alongside Step 4.
6. Incoming searches will involve 3 IndexSearchers, one each on Active-RAM +
Flushable-RAM + HDFS-Index. A given row-record will be found in only one index,
no matter the number of updates it has gone through.
Please let know your comments, on this approach. I have a few questions also,
but postponing it for future.
> Support for humongous Rows
> --------------------------
>
> Key: BLUR-220
> URL: https://issues.apache.org/jira/browse/BLUR-220
> Project: Apache Blur
> Issue Type: Improvement
> Components: Blur
> Affects Versions: 0.3.0
> Reporter: Aaron McCurry
> Fix For: 0.3.0
>
> Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java,
> CreateIndex.java, CreateSortedIndex.java, FullRowReindexing.java,
> MyEarlyTerminatingCollector.java, test_results.txt, TestSearch.java,
> TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the
> number of Records. The current updates are performed on Lucene is by
> deleting the document and re-adding to the index. Unfortunately when any
> update is perform on a Row in Blur, the entire Row has to be re-read (if the
> RowMutationType is UPDATE_ROW) and then whatever modification needs are made
> then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a
> given Row. It may vary based the kind of hardware that is being used, as the
> Row grows in size the indexing (mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this
> problem.
--
This message was sent by Atlassian JIRA
(v6.1#6144)