[ 
https://issues.apache.org/jira/browse/OAK-6269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikas Saurabh updated OAK-6269:
-------------------------------
    Attachment: s3Files.false.txt
                s3Files.true.txt
                S3DsSupport.patch

Did a 60 minute benchmark {{HybridIndexTest}} with disabled searcher against S3 
\[0]. Here is a bit of analysis:
|| Type || NumBlobs || DSSize (bytes) || 1047568 sized blobs|| BenchmarkN || 
BenchmarkMutator || BenchmarkIndexed ||
| Chunked | 3429 | 878830059 | 477 | 360748 | 1481808 | 2165436 |
| Streaming | 2960 | 891535294 | 0 | 370771 | 1492827 | 2225670 |

[^s3Files.true.txt] and [s3Files.false.txt] is list of s3 blobs after streaming 
and chunked runs respectively.

Given the numbers, I think the feature is behaving fine (at least doesn't have 
some obvious performance drop). Moreover, there's a fair amount of drop in 
number of blobs in a test run with not BIG index files - so, we are on the 
right track too. I'd hence commit this soon.

PS: Btw, there's no way in oak with S3DS currently, so, I used a quick patch 
([^S3DsSupport.patch]) - maybe someday this should make into trunk.
\[0]
{noformat}
$ for stream in true false
> do
>     for ds in S3DS
>     do
>         echo $stream - $ds
>         aws s3 rm s3://saurabh2/ --recursive --quiet
>         java -Doak.lucene.enableSingleBlobIndexFiles=$stream -Druntime=3600 
> -Ds3.config=/home/vsaurabh/Desktop/aws.properties -DblobStoreType=$ds 
> -DsearcherEnabled=false -DindexingMode=async -jar 
> oak-benchmarks/target/oak-benchmarks-1.8-SNAPSHOT.jar benchmark --base 
> ./oak-benchmarks/target/test HybridIndexTest Oak-Segment-Tar-DS
>         aws s3 ls saurabh2 | tee s3Files.$stream.$ds.txt | wc -l
>     done
> done
true - S3DS
Apache Jackrabbit Oak 1.8-SNAPSHOT
# HybridIndexTest                  C     min     10%     50%     90%     max    
   N Searcher  Mutator  Indexed
Oak-Segment-Tar-DS                 1       5       8       9      12    1419  
370771       0   1492827   2225670      #property,numIdxs:10
numOfIndexes: 10, refreshDeltaMillis: 1000, asyncInterval: 5, queueSize: 1000 , 
hybridIndexEnabled: false, indexingMode: async, useOakCodec: true 
2960
false - S3DS
Apache Jackrabbit Oak 1.8-SNAPSHOT
# HybridIndexTest                  C     min     10%     50%     90%     max    
   N Searcher  Mutator  Indexed
Oak-Segment-Tar-DS                 1       5       8       9      12    1444  
360748       0   1481808   2165436      #property,numIdxs:10
numOfIndexes: 10, refreshDeltaMillis: 1000, asyncInterval: 5, queueSize: 1000 , 
hybridIndexEnabled: false, indexingMode: async, useOakCodec: true 
3429
{noformat}

> Support non chunk storage in OakDirectory
> -----------------------------------------
>
>                 Key: OAK-6269
>                 URL: https://issues.apache.org/jira/browse/OAK-6269
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Vikas Saurabh
>             Fix For: 1.8
>
>         Attachments: 
> 0001-OAK-6269-Support-non-chunk-storage-in-OakDirectory.patch, 
> 0002-OAK-6269-Support-non-chunk-storage-in-OakDirectory.patch, 
> 0003-OAK-6269-Support-non-chunk-storage-in-OakDirectory.patch, 
> S3DsSupport.patch, s3Files.false.txt, s3Files.true.txt
>
>
> Logging this issue based on offline discussion with [~catholicon].
> Currently OakDirectory stores files in chunk of 1 MB each. So a 1 GB file 
> would be stored in 1000+ chunks of 1 MB.
> This design was done to support direct usage of OakDirectory with Lucene as 
> Lucene makes use of random io. Chunked storage allows it to seek to random 
> position quickly. If the files are stored as Blobs then its only possible to 
> access via streaming which would be slow
> As most setup now use copy-on-read and copy-on-write support and rely on 
> local copy of index we can have an implementation which stores the file as 
> single blob.
> *Pros*
> * Quite a bit of reduction in number of small blobs stored in BlobStore. 
> Which should reduce the GC time specially for S3 
> * Reduced overhead of storing a single file in repository. Instead of array 
> of 1k blobids we would be stored a single blobid
> * Potential improvement in IO cost as file can be read in one connection and 
> uploaded in one.
> *Cons*
> It would not be possible to use OakDirectory directly (or would be very slow) 
> and we would always need to do local copy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to