[ 
https://issues.apache.org/jira/browse/SOLR-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090886#comment-13090886
 ] 

Simon Willnauer commented on SOLR-2700:
---------------------------------------

{quote}
Just to get a rough idea of performance, I uploaded one of my CSV test files 
(765MB, 100M docs, 7 small string fields per doc).
Time to complete indexing was 42% longer, and the transaction log grew to 
1.8GB. The lucene index was 1.2GB. The log was on the same device, so the main 
impact may have been disk IO.
{quote}

I think this is far from what we can really do here. I didn't look too close at 
the code yet but it seems you are doing blocking writes which might not be 
ideal here at all. I think what you can do here is to allocate the space you 
need per record and write concurrently on a Channel (see 
FileChannel#write(ByteBuffer src, long position)), the same is true for reads 
(FileChannel#read(ByteBuffer dst, long position)). What we need to store in 
main memory is the offset and the length to do the realtime get here.

To take that one step further it might be good keep around the already 
serialized data if possible so if binary update is used can we piggyback the 
bytes in the SolrInputDocument somehow? If not I think we should use a faster 
hand written serialization instead of java serialization which is proven to be 
freaking slow.

Another totally different idea for the RT get is to spend more time on a RAM 
Reader that is capable of doing exactSeeks on the anyway used BytesRefHash. I 
don't thinks this would be too far away since the biggest problem here is to 
provide an efficiently sorted dictionary. maybe this should be a long term goal 
for the RT Get feature. 

Since we are already doing Write Behind here we could also try to use some 
compression especially if the source data is large, not sure if that will pay 
off though since we are not keeping the logs around forever. 

Eventually I think this should be a feature that lives outside of solr since 
many Lucene applications could make use of it. ElasticSearch for instance uses 
pretty similar features which could be adopted to something like a 
DurableIndexWriter wrapper.

> transaction logging
> -------------------
>
>                 Key: SOLR-2700
>                 URL: https://issues.apache.org/jira/browse/SOLR-2700
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Yonik Seeley
>         Attachments: SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, 
> SOLR-2700.patch, SOLR-2700.patch
>
>
> A transaction log is needed for durability of updates, for a more performant 
> realtime-get, and for replaying updates to recovering peers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to