If it helps, this is what the Oak configuration shows when it spins up:

2018-03-17 13:15:48.171  INFO 35898 --- [           main]
o.a.j.oak.segment.file.FileStore         : Creating file store
FileStoreBuilder{version=1.8-SNAPSHOT, directory=/dms/oak-repository,
blobStore=null, maxFileSize=256, segmentCacheSize=256, stringCacheSize=256,
templateCacheSize=64, stringDeduplicationCacheSize=15000,
templateDeduplicationCacheSize=3000, nodeDeduplicationCacheSize=1048576,
memoryMapping=true, gcOptions=SegmentGCOptions{paused=false,
estimationDisabled=false, gcSizeDeltaEstimation=1073741824, retryCount=5,
forceTimeout=60, retainedGenerations=2, gcType=FULL}}
2018-03-17 13:15:48.774  INFO 35898 --- [           main]
o.a.j.oak.segment.file.FileStore         : TarMK opened:
/dms/oak-repository (mmap=true)
2018-03-17 13:15:48.806  INFO 35898 --- [           main]
SegmentNodeStore$SegmentNodeStoreBuilder : Creating segment node store
SegmentNodeStoreBuilder{blobStore=inline}
2018-03-17 13:15:48.812  INFO 35898 --- [           main]
o.a.j.o.s.scheduler.LockBasedScheduler   : Initializing SegmentNodeStore
with the commitFairLock option enabled.

At this point, there are 215,275 documents in the repository.  In the TarMK
/ repository folder, the size on disk is about 118G.  Creating and save a
file node take about 3-5 seconds, and calling the VersionManager to check
it in takes about the same.  When there was no content in the repository,
both of those operations are almost instantaneous.

I've been looking at the Adobe AEM documentation / forums, and there seems
to be some mixed guidance on using offline compaction (I'm not using AEM
but assuming the general principles would apply).  I have tried the same
thing with GC disabled:

SegmentGCOptions gcOptions =
SegmentGCOptions.defaultGCOptions().setOffline();
...
fsBuilder.withGCOptions(gcOptions);

...with the same basic performance as I see with it enabled.  (I'm getting
one file imported ever ~3-10 seconds.)

I just came across a presentation:
https://adapt.to/content/dam/adaptto/production/presentations/2016/adaptTo2016-Into-the-tar-pit-a-TarMK-deep-dive-Michael-Duerig-notes.pdf/_jcr_content/renditions/original./adaptTo2016-Into-the-tar-pit-a-TarMK-deep-dive-Michael-Duerig-notes.pdf

It mentions TarMK has "limited scalability"...  my initial tests of a
handful of gigabytes seemed OK, but am I hitting an inherent limitation in
the size of data TarMK can handle?


On Sat, Mar 17, 2018 at 1:32 PM, William Markmann <
b...@counterpointconsulting.com> wrote:

> After adding some JMX instrumentation, I can clearly see that the time is
> being spent in the session.save() called when initially creating the
> document node, and then calling vm.checkin(newNode.getPath()) after
> that.  These slowed way down as more and more nodes were added to the
> repository.
>
> 1) Is it necessary to do the above in two steps, or can a Node be created
> and checked in with the VersionManager in one shot?
> 2) What is actually happening in terms of indexing when I do the above?
> Is there any way to / would it be useful to temporarily disable any
> on-the-fly indexing during a bulk import and run it later?
> 3) Does GC / compaction in the NodeStore come into play when adding
> nodes?  Same question -- is there a way to / would it be useful to disable
> anything related to those during a bulk import and perform them offline
> after?
> 4) Is there anything inherent in the SegmentNodeStore that would decrease
> in performance as the repository grows?
>
> Thanks! - Bill
>
>
> On Fri, Mar 16, 2018 at 12:30 PM, William Markmann <
> b...@counterpointconsulting.com> wrote:
>
>> Is there any reason I'd see:
>>
>> 2018-03-15 21:48:54.673  INFO 20475 --- [ex-update-async]
>> o.a.j.oak.plugins.index.IndexUpdate      : Incremental indexing
>> Traversed #10000 /NJ Foreclosure/342CKA-IANWK/SCRA
>> SEARCHES-AAMZG/FRCL201604PTI00003XEGT-20160425-4351400/jcr:content
>> [Infinity nodes/s, Infinity nodes/hr]
>>
>> ...regularly at the outset, but it stops appearing after a certain point?
>>
>> If I take a thread-dump after it starts slowing down, I see the worker
>> threads (usually all but one once slow-down starts) parked at:
>>
>> "pool-5-thread-3" #25772 prio=5 os_prio=0 tid=0x000000000201e000
>> nid=0xc0c0 waiting on condition [0x00007f42486cc000]
>>    java.lang.Thread.State: WAITING (parking)
>> at sun.misc.Unsafe.park(Native Method)
>> - parking to wait for  <0x00000004c1327098> (a
>> java.util.concurrent.Semaphore$FairSync)
>> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>> at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAn
>> dCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>> at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcqu
>> ireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>> at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquir
>> eSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>> at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
>> at org.apache.jackrabbit.oak.segment.scheduler.LockBasedSchedul
>> er.schedule(LockBasedScheduler.java:217)
>> at org.apache.jackrabbit.oak.segment.SegmentNodeStore.merge(
>> SegmentNodeStore.java:195)
>> at org.apache.jackrabbit.oak.core.MutableRoot.commit(MutableRoo
>> t.java:248)
>> at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.commi
>> t(SessionDelegate.java:347)
>> at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.commi
>> t(SessionDelegate.java:360)
>> at org.apache.jackrabbit.oak.jcr.version.ReadWriteVersionManage
>> r.checkin(ReadWriteVersionManager.java:129)
>> at org.apache.jackrabbit.oak.jcr.delegate.VersionManagerDelegat
>> e.checkin(VersionManagerDelegate.java:67)
>> at org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl$7.p
>> erform(VersionManagerImpl.java:371)
>> at org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl$7.p
>> erform(VersionManagerImpl.java:362)
>> at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perfo
>> rm(SessionDelegate.java:208)
>> at org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl.che
>> ckin(VersionManagerImpl.java:362)
>>
>>
>> When I'm inserting documents at the very beginning (no content yet), the
>> individual threads don't park at that state for nearly as long...
>>
>>
>>
>> On Fri, Mar 16, 2018 at 11:55 AM, Julian Reschke <julian.resc...@gmx.de>
>> wrote:
>>
>>> On 2018-03-16 16:46, William Markmann wrote:
>>>
>>>> Folders are: *org.apache.jackrabbit.JcrConstants.NT_FOLDER*
>>>> Documents are:
>>>>
>>>>                  Binary fileBinary = 
>>>> session.getValueFactory().createBinary(new
>>>> ByteArrayInputStream(data));
>>>>                  Node newFile = parentNode.addNode(filename,
>>>> *JcrConstants.NT_FILE*);
>>>>                  newFile.addMixin(*JcrConstants.MIX_VERSIONABLE*);
>>>>                  Node docContents = 
>>>> newFile.addNode(*JcrConstants.JCR_CONTENT*,
>>>> *JcrConstants.NT_RESOURCE*);
>>>>                  // docContents.setProperty(JcrConstants.JCR_MIMETYPE,
>>>> getMimeType(filename, getFileExtension(filename)));
>>>>                  docContents.setProperty(JcrConstants.JCR_MIMETYPE,
>>>> FileUtils.getMimeType(FileUtils.getFileExtension(filename)));
>>>>                  docContents.setProperty(JcrConstants.JCR_ENCODING,
>>>> "");
>>>>                  docContents.setProperty(JcrConstants.JCR_DATA,
>>>> fileBinary);
>>>>
>>>> Is there a better choice?
>>>> ...
>>>>
>>>
>>> I was worried the folder might have "orderable" child nodes, which
>>> creates a significant overhead. But AFAIR that is not the case for
>>> nt:folder (but you may want to check).
>>>
>>> Best regards, Julian
>>>
>>> PS: I wouldn't set JCR_ENCODING if that information isn't present.
>>>
>>
>>
>>
>> --
>> *Bill Markmann*
>> *President | 866 809 0394 x 701*
>> *Counterpoint Consulting*
>> *Automate. Innovate. Accelerate.*
>> c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
>> <http://www.linkedin.com/company/counterpoint-consulting-inc.>** |
>> Twitter <https://twitter.com/c20g>*
>>
>
>
>
> --
> *Bill Markmann*
> *President | 866 809 0394 x 701*
> *Counterpoint Consulting*
> *Automate. Innovate. Accelerate.*
> c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
> <http://www.linkedin.com/company/counterpoint-consulting-inc.>** |
> Twitter <https://twitter.com/c20g>*
>



-- 
*Bill Markmann*
*President | 866 809 0394 x 701*
*Counterpoint Consulting*
*Automate. Innovate. Accelerate.*
c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
<http://www.linkedin.com/company/counterpoint-consulting-inc.>** | Twitter
<https://twitter.com/c20g>*

Reply via email to