Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Chetan Mehrotra
On Tue, Mar 10, 2015 at 1:50 PM, Michael Marth mma...@adobe.com wrote:
 But I wonder: how do you envision that this new index cleanup would locate 
 indexes in the content-addressed DS

Thats bit tricky. Have rough idea here on how to approach but would
require more thinking here. The approach I am thinking of is

1. Have an index on oak:QueryIndexDefinition
2. Query for all index definition nodes with type=lucene
3. Get the ':data node and then perform the listing. Each child node
is a Lucene index file representation

For Mongo I can easy read the previous revisions of the jcr:blob
property and then extract the blobId which can be then be deleted via
direct invocation GarbageCollectableBlobStore API. For Segment I am
not sure how to easily read previous revisions of given NodeState

Chetan Mehrotra


Re: working lucene fulltext index

2015-03-10 Thread Torgeir Veimo
Thank you! This example helped me iron out the errors in my index configuration!

It would be good to have a bit more example code online for these things.

On 6 March 2015 at 04:16, Chetan Mehrotra chetan.mehro...@gmail.com wrote:
 Hi Torgeir,

 Sorry for the delay here as got stuck with other issues. I tried your
 approach and it looks like you had a typo in your index defintion

 -  .setProperty(isRegExp, true)
 +  .setProperty(isRegexp, true)
   .setProperty(nodeScopeIndex, true);

 I tried to create a standalone example which you can give a try to see
 lucene index in action [1]

 Let me know if you still face any issue

 Chetan Mehrotra
 [1] https://gist.github.com/chetanmeh/c1ccc4fa588ed1af467b


 On Wed, Feb 25, 2015 at 7:26 PM, Torgeir Veimo torgeir.ve...@gmail.com 
 wrote:
 Sorted out my lucene version issues, so not getting that exception any
 more, but still not getting any query results. Still seeing multiple
 of these in the logs;

 23:55:14,288 TRACE lucene.IndexDefinition.collectIndexRules() - line
 519 [0:0:0:0:0:0:0:1] - Found rule 'IndexRule: ka:asset' for NodeType
 'ka:asset'
 23:55:14,288 TRACE lucene.IndexDefinition.collectIndexRules() - line
 535 [0:0:0:0:0:0:0:1] - Registering rule 'IndexRule: ka:asset' for
 name 'ka:asset'

 On 25 February 2015 at 16:49, Torgeir Veimo torgeir.ve...@gmail.com wrote:
 I tried without the async: async property on the lucene index, on an
 empty repository, and am seeing an exception.

 Any idea on how I can try to find the cause of this?

 I assume if I tried to run with the lucene index on disk instead of in
 the segment store, I might avoid this, but the documentation doesn't
 really outline how to do this in much detail.

 16:44:09,437 INFO  index.IndexUpdate.enter() - line 110 [] -
 Reindexing will be performed for following indexes:
 [/oak:index/ka:owner, /oak:index/positionref, /oak:index/targetId,
 /oak:index/uuid, /oak:index/ka:id, /oak:index/mail,
 /oak:index/ka:tags, /oak:index/active, /oak:index/ka:applicationState,
 /oak:index/parentTargetId, /oak:index/reference, /oak:index/ka:uid,
 /oak:index/ka:rememberme, /oak:index/ka:state, /oak:index/ka:serial,
 /oak:index/ka:assetType, /oak:index/lucene, /oak:index/ka:series,
 /oak:index/ka:principal, /oak:index/affiliation, /oak:index/ka:expire,
 /oak:index/companyref, /oak:index/title, /oak:index/lastCommentDate,
 /oak:index/ka:subscriptionFrequency, /oak:index/nodetype]
 16:44:09,547 WARN  support.AbstractApplicationContext.refresh() - line
 486 [] - Exception encountered during context initialization -
 cancelling refresh attempt
 org.springframework.beans.factory.BeanCreationException: Error
 creating bean with name 'assetOwnerPermission': Injection of autowired
 dependencies failed; nested exception is
 org.springframework.beans.factory.BeanCreationException: Could not
 autowire field: no.karriere.content.dao.AssetRepository
 no.karriere.content.authorization.permissions.AbstractPermission.assetRepository;
 nested exception is
 org.springframework.beans.factory.BeanCreationException: Error
 creating bean with name 'assetRepository': Injection of autowired
 dependencies failed; nested exception is
 org.springframework.beans.factory.BeanCreationException: Could not
 autowire field: no.karriere.content.dao.jcr.MediaHelper
 no.karriere.content.dao.jcr.JcrAssetRepository.mediaHelper; nested
 exception is org.springframework.beans.factory.BeanCreationException:
 Error creating bean with name 'mediaHelper': Injection of autowired
 dependencies failed; nested exception is
 org.springframework.beans.factory.BeanCreationException: Could not
 autowire field:
 no.karriere.content.services.repository.RepositoryService
 no.karriere.content.dao.jcr.MediaHelper.repositoryService; nested
 exception is org.springframework.beans.factory.BeanCreationException:
 Error creating bean with name 'repositoryService': Injection of
 autowired dependencies failed; nested exception is
 org.springframework.beans.factory.BeanCreationException: Could not
 autowire field: javax.jcr.Repository
 no.karriere.content.services.repository.RepositoryService.oakRepository;
 nested exception is
 org.springframework.beans.factory.BeanCreationException: Error
 creating bean with name 'getRepository' defined in class path resource
 [no/karriere/content/dao/jcr/repository/RepositoryConfiguration.class]:
 Instantiation of bean failed; nested exception is
 org.springframework.beans.factory.BeanDefinitionStoreException:
 Factory method [public javax.jcr.Repository
 no.karriere.content.dao.jcr.repository.RepositoryConfiguration.getRepository()
 throws no.karriere.content.exception.ContentException] threw
 exception; nested exception is java.lang.AbstractMethodError:
 org.apache.lucene.store.IndexOutput.getChecksum()J
 at 
 org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor.postProcessPropertyValues(AutowiredAnnotationBeanPostProcessor.java:298)
 at 
 

Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Michael Marth
Could the Lucene indexer explicitly track these files (e.g. as a property in 
the index definition)? And also take care of removing them? (the latter part is 
assuming that the same index file is not identical across various definitions)

 On 10 Mar 2015, at 12:18, Chetan Mehrotra chetan.mehro...@gmail.com wrote:
 
 On Tue, Mar 10, 2015 at 4:12 PM, Michael Dürig mdue...@apache.org wrote:
 The problem is that you don't even have a list of all previous revisions of
 the root node state. Revisions are created on the fly and kept as needed.
 
 hmm yup. Then we would need to think of some other approach to know
 all the blobId referred to by the Lucene Index files
 
 
 Chetan Mehrotra



Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Chetan Mehrotra
On Tue, Mar 10, 2015 at 3:33 PM, Michael Dürig mdue...@apache.org wrote:
 SegmentMK doesn't even have the concept of a previous revision of a
 NodeState.

Yes that is to be thought about. I want to read all previous revision
for path /oak:index/lucene/:data. For segment I believe I would need
to start at root references for all previous revisions and then read
along the required path from those root segments to collect previous
revisions.

Would that work?

Chetan Mehrotra


Re: Parallelize text extraction from binary fields

2015-03-10 Thread Chetan Mehrotra
 Is Oak already single instance when it comes to the identification and 
 storage of binaries ?

Yes. Oak uses content addressable storage for binaries

 Are the existing TextExtractors also single instance ?

No. If same binary is referred at multiple places then text extraction
would be performed for each such reference of that binary

 By Single instance I mean, 1 copy of the binary and its token stream in the 
 repository regardless of how many times its referenced.

So based on above token stream would be multiple.

What's the approach you are thinking ... and would benefit from
'Single instance' based design?
Chetan Mehrotra


On Tue, Mar 10, 2015 at 1:15 PM, Ian Boston i...@tfd.co.uk wrote:
 Hi,
 Is Oak already single instance when it comes to the identification and
 storage of binaries ?
 Are the existing TextExtractors also single instance ?
 By Single instance I mean, 1 copy of the binary and its token stream in the
 repository regardless of how many times its referenced.

 Best Regards
 Ian

 On 10 March 2015 at 07:05, Chetan Mehrotra chetan.mehro...@gmail.com
 wrote:

 LuceneIndexEditor currently extract the binary contents via Tika in
 same thread which is used for processing the commit. Such an approach
 does not make good use of multi processor system specifically when
 index is being built up as part of migration process.

 Looking at JR2 I see LazyTextExtractor [1] which I think would help in
 parallelize text extraction.

 Would it make sense to bring this to Oak. Would that help in improving
 performance?

 Chetan Mehrotra
 [1]
 https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java



Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Michael Dürig



On 10.3.15 11:32 , Chetan Mehrotra wrote:

On Tue, Mar 10, 2015 at 3:33 PM, Michael Dürig mdue...@apache.org wrote:

SegmentMK doesn't even have the concept of a previous revision of a
NodeState.


Yes that is to be thought about. I want to read all previous revision
for path /oak:index/lucene/:data. For segment I believe I would need
to start at root references for all previous revisions and then read
along the required path from those root segments to collect previous
revisions.


The problem is that you don't even have a list of all previous revisions 
of the root node state. Revisions are created on the fly and kept as 
needed.


Michael



Would that work?

Chetan Mehrotra



Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Chetan Mehrotra
On Tue, Mar 10, 2015 at 4:12 PM, Michael Dürig mdue...@apache.org wrote:
 The problem is that you don't even have a list of all previous revisions of
 the root node state. Revisions are created on the fly and kept as needed.

hmm yup. Then we would need to think of some other approach to know
all the blobId referred to by the Lucene Index files


Chetan Mehrotra


Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Chetan Mehrotra
Thats one approach we can think about. Thinking further with Lucene
design of immutable files things become simpler (ignoring the reindex
case). In normal usage Lucene never reuses the file name and never
modifies any existing file. So we would not have to worry about
reading older revisions. We only need to keep track of deleted file
and blob's referred by them.

So once a file node is marked as deleted we can possibly have a diff
performed (we already do it to detect when index is changed) and
collect blobId from deleted file nodes from previous state. Those can
be safely deleted *after* some time (allowing other cluster nodes to
pickup).
Chetan Mehrotra


On Tue, Mar 10, 2015 at 4:53 PM, Michael Marth mma...@adobe.com wrote:
 Could the Lucene indexer explicitly track these files (e.g. as a property in 
 the index definition)? And also take care of removing them? (the latter part 
 is assuming that the same index file is not identical across various 
 definitions)

 On 10 Mar 2015, at 12:18, Chetan Mehrotra chetan.mehro...@gmail.com wrote:

 On Tue, Mar 10, 2015 at 4:12 PM, Michael Dürig mdue...@apache.org wrote:
 The problem is that you don't even have a list of all previous revisions of
 the root node state. Revisions are created on the fly and kept as needed.

 hmm yup. Then we would need to think of some other approach to know
 all the blobId referred to by the Lucene Index files


 Chetan Mehrotra



Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Michael Dürig



On 10.3.15 10:49 , Chetan Mehrotra wrote:

  For Segment I am
not sure how to easily read previous revisions of given NodeState


SegmentMK doesn't even have the concept of a previous revision of a 
NodeState.


Michael


[RESULT][VOTE] Release Apache Jackrabbit Oak 1.0.12

2015-03-10 Thread Marcel Reutegger
Hi,

the vote passes as follows:

+1 Michael Dürig
+1 Amit Jain
+1 Alex Parvulescu
+1 Davide Giannella
+1 Julian Reschke
+1 Thomas Mueller

I'll push the release out.

Thomas, your vote was a bit unclear. Your first statement was
a +1 vote. Later you voiced concerns and suggested to not
release the candidate as 1.0.12, though you didn't want to
have it interpreted as a -1.

Maybe you didn't want to vote -1 because you thought it is a veto?
This is a majority vote and the release candidate will only fail
if we don't reach a majority of +1s or not sufficient +1s (three are
required at least).

Regards
 Marcel

On 04/03/15 11:07, Marcel Reutegger mreut...@adobe.com wrote:

A candidate for the Jackrabbit Oak 1.0.12 release is available at:

https://dist.apache.org/repos/dist/dev/jackrabbit/oak/1.0.12/

The release candidate is a zip archive of the sources in:


https://svn.apache.org/repos/asf/jackrabbit/oak/tags/jackrabbit-oak-1.0.12
/

The SHA1 checksum of the archive is
c442265596bb303042b4d3b2e218201d850ec153.

A staged Maven repository is available for review at:

https://repository.apache.org/

The command for running automated checks against this release candidate
is:

$ sh check-release.sh oak 1.0.12
c442265596bb303042b4d3b2e218201d850ec153

Please vote on releasing this package as Apache Jackrabbit Oak 1.0.12.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Jackrabbit PMC votes are cast.

[ ] +1 Release this package as Apache Jackrabbit Oak 1.0.12
[ ] -1 Do not release this package because...




Regards
 Marcel




Re: Parallelize text extraction from binary fields

2015-03-10 Thread Ian Boston
Hi,

On 10 March 2015 at 09:52, Chetan Mehrotra chetan.mehro...@gmail.com
wrote:

  Is Oak already single instance when it comes to the identification and
 storage of binaries ?

 Yes. Oak uses content addressable storage for binaries

  Are the existing TextExtractors also single instance ?

 No. If same binary is referred at multiple places then text extraction
 would be performed for each such reference of that binary

  By Single instance I mean, 1 copy of the binary and its token stream in
 the repository regardless of how many times its referenced.

 So based on above token stream would be multiple.

 What's the approach you are thinking ... and would benefit from
 'Single instance' based design?


Tokenize once, and store the token stream with the binary so it can be
re-used rather than re-processed.
Obviously if the content of the binary changes and its not immutable, the
token stream has to be re-processed.
Best Regards
Ian



 Chetan Mehrotra


 On Tue, Mar 10, 2015 at 1:15 PM, Ian Boston i...@tfd.co.uk wrote:
  Hi,
  Is Oak already single instance when it comes to the identification and
  storage of binaries ?
  Are the existing TextExtractors also single instance ?
  By Single instance I mean, 1 copy of the binary and its token stream in
 the
  repository regardless of how many times its referenced.
 
  Best Regards
  Ian
 
  On 10 March 2015 at 07:05, Chetan Mehrotra chetan.mehro...@gmail.com
  wrote:
 
  LuceneIndexEditor currently extract the binary contents via Tika in
  same thread which is used for processing the commit. Such an approach
  does not make good use of multi processor system specifically when
  index is being built up as part of migration process.
 
  Looking at JR2 I see LazyTextExtractor [1] which I think would help in
  parallelize text extraction.
 
  Would it make sense to bring this to Oak. Would that help in improving
  performance?
 
  Chetan Mehrotra
  [1]
 
 https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java
 



Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Thomas Mueller
Hi,

I think removing binaries directly without going though the GC logic is
dangerous, because we can't be sure if there are other references. There
is one exception, it is if each file is guaranteed to be unique. For that,
we could for example append a unique UUID to each file. The Lucene file
system implementation would need to be changed for that (write the UUID,
but ignore it when reading and reading the file size).

Even in that case, there is still a risk, for example if the binary
_reference_ is copied, or if an old revision is accessed. How do we ensure
this does not happen?

Regards,
Thomas


On 10/03/15 07:46, Chetan Mehrotra chetan.mehro...@gmail.com wrote:

Hi Team,

With storing of Lucene index files within DataStore our usage pattern
of DataStore has changed between JR2 and Oak.

With JR2 the writes were mostly application based i.e. if application
stores a pdf/image file then that would be stored in DataStore. JR2 by
default would not write stuff to DataStore. Further in deployment
where large number of binary content is present then systems tend to
share the DataStore to avoid duplication of storage. In such cases
running Blob GC is a non trivial task as it involves a manual step and
coordination across multiple deployments. Due to this systems tend to
delay frequency of GC

Now with Oak apart from application the Oak system itself *actively*
uses the DataStore to store the index files for Lucene and there the
churn might be much higher i.e. frequency of creation and deletion of
index file is lot higher. This would accelerate the rate of garbage
generation and thus put lot more pressure on the DataStore storage
requirements.

Any thoughts on how to avoid/reduce the requirement to increase the
frequency of Blob GC?

One possible way would be to provide a special cleanup tool which can
look for such old Lucene index files and deletes them directly without
going through the full fledged MarkAndSweep logic

Thoughts?

Chetan Mehrotra



Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Chetan Mehrotra
Hi Team,

With storing of Lucene index files within DataStore our usage pattern
of DataStore has changed between JR2 and Oak.

With JR2 the writes were mostly application based i.e. if application
stores a pdf/image file then that would be stored in DataStore. JR2 by
default would not write stuff to DataStore. Further in deployment
where large number of binary content is present then systems tend to
share the DataStore to avoid duplication of storage. In such cases
running Blob GC is a non trivial task as it involves a manual step and
coordination across multiple deployments. Due to this systems tend to
delay frequency of GC

Now with Oak apart from application the Oak system itself *actively*
uses the DataStore to store the index files for Lucene and there the
churn might be much higher i.e. frequency of creation and deletion of
index file is lot higher. This would accelerate the rate of garbage
generation and thus put lot more pressure on the DataStore storage
requirements.

Any thoughts on how to avoid/reduce the requirement to increase the
frequency of Blob GC?

One possible way would be to provide a special cleanup tool which can
look for such old Lucene index files and deletes them directly without
going through the full fledged MarkAndSweep logic

Thoughts?

Chetan Mehrotra


Parallelize text extraction from binary fields

2015-03-10 Thread Chetan Mehrotra
LuceneIndexEditor currently extract the binary contents via Tika in
same thread which is used for processing the commit. Such an approach
does not make good use of multi processor system specifically when
index is being built up as part of migration process.

Looking at JR2 I see LazyTextExtractor [1] which I think would help in
parallelize text extraction.

Would it make sense to bring this to Oak. Would that help in improving
performance?

Chetan Mehrotra
[1] 
https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java


Re: Parallelize text extraction from binary fields

2015-03-10 Thread Ian Boston
Hi,
Is Oak already single instance when it comes to the identification and
storage of binaries ?
Are the existing TextExtractors also single instance ?
By Single instance I mean, 1 copy of the binary and its token stream in the
repository regardless of how many times its referenced.

Best Regards
Ian

On 10 March 2015 at 07:05, Chetan Mehrotra chetan.mehro...@gmail.com
wrote:

 LuceneIndexEditor currently extract the binary contents via Tika in
 same thread which is used for processing the commit. Such an approach
 does not make good use of multi processor system specifically when
 index is being built up as part of migration process.

 Looking at JR2 I see LazyTextExtractor [1] which I think would help in
 parallelize text extraction.

 Would it make sense to bring this to Oak. Would that help in improving
 performance?

 Chetan Mehrotra
 [1]
 https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java



Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Michael Marth
Hi Chetan,

I like the idea.
But I wonder: how do you envision that this new index cleanup would locate 
indexes in the content-addressed DS?

Michael

 On 10 Mar 2015, at 07:46, Chetan Mehrotra chetan.mehro...@gmail.com wrote:
 
 Hi Team,
 
 With storing of Lucene index files within DataStore our usage pattern
 of DataStore has changed between JR2 and Oak.
 
 With JR2 the writes were mostly application based i.e. if application
 stores a pdf/image file then that would be stored in DataStore. JR2 by
 default would not write stuff to DataStore. Further in deployment
 where large number of binary content is present then systems tend to
 share the DataStore to avoid duplication of storage. In such cases
 running Blob GC is a non trivial task as it involves a manual step and
 coordination across multiple deployments. Due to this systems tend to
 delay frequency of GC
 
 Now with Oak apart from application the Oak system itself *actively*
 uses the DataStore to store the index files for Lucene and there the
 churn might be much higher i.e. frequency of creation and deletion of
 index file is lot higher. This would accelerate the rate of garbage
 generation and thus put lot more pressure on the DataStore storage
 requirements.
 
 Any thoughts on how to avoid/reduce the requirement to increase the
 frequency of Blob GC?
 
 One possible way would be to provide a special cleanup tool which can
 look for such old Lucene index files and deletes them directly without
 going through the full fledged MarkAndSweep logic
 
 Thoughts?
 
 Chetan Mehrotra