Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
On Tue, Mar 10, 2015 at 1:50 PM, Michael Marth mma...@adobe.com wrote: But I wonder: how do you envision that this new index cleanup would locate indexes in the content-addressed DS Thats bit tricky. Have rough idea here on how to approach but would require more thinking here. The approach I am thinking of is 1. Have an index on oak:QueryIndexDefinition 2. Query for all index definition nodes with type=lucene 3. Get the ':data node and then perform the listing. Each child node is a Lucene index file representation For Mongo I can easy read the previous revisions of the jcr:blob property and then extract the blobId which can be then be deleted via direct invocation GarbageCollectableBlobStore API. For Segment I am not sure how to easily read previous revisions of given NodeState Chetan Mehrotra
Re: working lucene fulltext index
Thank you! This example helped me iron out the errors in my index configuration! It would be good to have a bit more example code online for these things. On 6 March 2015 at 04:16, Chetan Mehrotra chetan.mehro...@gmail.com wrote: Hi Torgeir, Sorry for the delay here as got stuck with other issues. I tried your approach and it looks like you had a typo in your index defintion - .setProperty(isRegExp, true) + .setProperty(isRegexp, true) .setProperty(nodeScopeIndex, true); I tried to create a standalone example which you can give a try to see lucene index in action [1] Let me know if you still face any issue Chetan Mehrotra [1] https://gist.github.com/chetanmeh/c1ccc4fa588ed1af467b On Wed, Feb 25, 2015 at 7:26 PM, Torgeir Veimo torgeir.ve...@gmail.com wrote: Sorted out my lucene version issues, so not getting that exception any more, but still not getting any query results. Still seeing multiple of these in the logs; 23:55:14,288 TRACE lucene.IndexDefinition.collectIndexRules() - line 519 [0:0:0:0:0:0:0:1] - Found rule 'IndexRule: ka:asset' for NodeType 'ka:asset' 23:55:14,288 TRACE lucene.IndexDefinition.collectIndexRules() - line 535 [0:0:0:0:0:0:0:1] - Registering rule 'IndexRule: ka:asset' for name 'ka:asset' On 25 February 2015 at 16:49, Torgeir Veimo torgeir.ve...@gmail.com wrote: I tried without the async: async property on the lucene index, on an empty repository, and am seeing an exception. Any idea on how I can try to find the cause of this? I assume if I tried to run with the lucene index on disk instead of in the segment store, I might avoid this, but the documentation doesn't really outline how to do this in much detail. 16:44:09,437 INFO index.IndexUpdate.enter() - line 110 [] - Reindexing will be performed for following indexes: [/oak:index/ka:owner, /oak:index/positionref, /oak:index/targetId, /oak:index/uuid, /oak:index/ka:id, /oak:index/mail, /oak:index/ka:tags, /oak:index/active, /oak:index/ka:applicationState, /oak:index/parentTargetId, /oak:index/reference, /oak:index/ka:uid, /oak:index/ka:rememberme, /oak:index/ka:state, /oak:index/ka:serial, /oak:index/ka:assetType, /oak:index/lucene, /oak:index/ka:series, /oak:index/ka:principal, /oak:index/affiliation, /oak:index/ka:expire, /oak:index/companyref, /oak:index/title, /oak:index/lastCommentDate, /oak:index/ka:subscriptionFrequency, /oak:index/nodetype] 16:44:09,547 WARN support.AbstractApplicationContext.refresh() - line 486 [] - Exception encountered during context initialization - cancelling refresh attempt org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'assetOwnerPermission': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: no.karriere.content.dao.AssetRepository no.karriere.content.authorization.permissions.AbstractPermission.assetRepository; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'assetRepository': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: no.karriere.content.dao.jcr.MediaHelper no.karriere.content.dao.jcr.JcrAssetRepository.mediaHelper; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'mediaHelper': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: no.karriere.content.services.repository.RepositoryService no.karriere.content.dao.jcr.MediaHelper.repositoryService; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'repositoryService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: javax.jcr.Repository no.karriere.content.services.repository.RepositoryService.oakRepository; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'getRepository' defined in class path resource [no/karriere/content/dao/jcr/repository/RepositoryConfiguration.class]: Instantiation of bean failed; nested exception is org.springframework.beans.factory.BeanDefinitionStoreException: Factory method [public javax.jcr.Repository no.karriere.content.dao.jcr.repository.RepositoryConfiguration.getRepository() throws no.karriere.content.exception.ContentException] threw exception; nested exception is java.lang.AbstractMethodError: org.apache.lucene.store.IndexOutput.getChecksum()J at org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor.postProcessPropertyValues(AutowiredAnnotationBeanPostProcessor.java:298) at
Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
Could the Lucene indexer explicitly track these files (e.g. as a property in the index definition)? And also take care of removing them? (the latter part is assuming that the same index file is not identical across various definitions) On 10 Mar 2015, at 12:18, Chetan Mehrotra chetan.mehro...@gmail.com wrote: On Tue, Mar 10, 2015 at 4:12 PM, Michael Dürig mdue...@apache.org wrote: The problem is that you don't even have a list of all previous revisions of the root node state. Revisions are created on the fly and kept as needed. hmm yup. Then we would need to think of some other approach to know all the blobId referred to by the Lucene Index files Chetan Mehrotra
Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
On Tue, Mar 10, 2015 at 3:33 PM, Michael Dürig mdue...@apache.org wrote: SegmentMK doesn't even have the concept of a previous revision of a NodeState. Yes that is to be thought about. I want to read all previous revision for path /oak:index/lucene/:data. For segment I believe I would need to start at root references for all previous revisions and then read along the required path from those root segments to collect previous revisions. Would that work? Chetan Mehrotra
Re: Parallelize text extraction from binary fields
Is Oak already single instance when it comes to the identification and storage of binaries ? Yes. Oak uses content addressable storage for binaries Are the existing TextExtractors also single instance ? No. If same binary is referred at multiple places then text extraction would be performed for each such reference of that binary By Single instance I mean, 1 copy of the binary and its token stream in the repository regardless of how many times its referenced. So based on above token stream would be multiple. What's the approach you are thinking ... and would benefit from 'Single instance' based design? Chetan Mehrotra On Tue, Mar 10, 2015 at 1:15 PM, Ian Boston i...@tfd.co.uk wrote: Hi, Is Oak already single instance when it comes to the identification and storage of binaries ? Are the existing TextExtractors also single instance ? By Single instance I mean, 1 copy of the binary and its token stream in the repository regardless of how many times its referenced. Best Regards Ian On 10 March 2015 at 07:05, Chetan Mehrotra chetan.mehro...@gmail.com wrote: LuceneIndexEditor currently extract the binary contents via Tika in same thread which is used for processing the commit. Such an approach does not make good use of multi processor system specifically when index is being built up as part of migration process. Looking at JR2 I see LazyTextExtractor [1] which I think would help in parallelize text extraction. Would it make sense to bring this to Oak. Would that help in improving performance? Chetan Mehrotra [1] https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java
Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
On 10.3.15 11:32 , Chetan Mehrotra wrote: On Tue, Mar 10, 2015 at 3:33 PM, Michael Dürig mdue...@apache.org wrote: SegmentMK doesn't even have the concept of a previous revision of a NodeState. Yes that is to be thought about. I want to read all previous revision for path /oak:index/lucene/:data. For segment I believe I would need to start at root references for all previous revisions and then read along the required path from those root segments to collect previous revisions. The problem is that you don't even have a list of all previous revisions of the root node state. Revisions are created on the fly and kept as needed. Michael Would that work? Chetan Mehrotra
Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
On Tue, Mar 10, 2015 at 4:12 PM, Michael Dürig mdue...@apache.org wrote: The problem is that you don't even have a list of all previous revisions of the root node state. Revisions are created on the fly and kept as needed. hmm yup. Then we would need to think of some other approach to know all the blobId referred to by the Lucene Index files Chetan Mehrotra
Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
Thats one approach we can think about. Thinking further with Lucene design of immutable files things become simpler (ignoring the reindex case). In normal usage Lucene never reuses the file name and never modifies any existing file. So we would not have to worry about reading older revisions. We only need to keep track of deleted file and blob's referred by them. So once a file node is marked as deleted we can possibly have a diff performed (we already do it to detect when index is changed) and collect blobId from deleted file nodes from previous state. Those can be safely deleted *after* some time (allowing other cluster nodes to pickup). Chetan Mehrotra On Tue, Mar 10, 2015 at 4:53 PM, Michael Marth mma...@adobe.com wrote: Could the Lucene indexer explicitly track these files (e.g. as a property in the index definition)? And also take care of removing them? (the latter part is assuming that the same index file is not identical across various definitions) On 10 Mar 2015, at 12:18, Chetan Mehrotra chetan.mehro...@gmail.com wrote: On Tue, Mar 10, 2015 at 4:12 PM, Michael Dürig mdue...@apache.org wrote: The problem is that you don't even have a list of all previous revisions of the root node state. Revisions are created on the fly and kept as needed. hmm yup. Then we would need to think of some other approach to know all the blobId referred to by the Lucene Index files Chetan Mehrotra
Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
On 10.3.15 10:49 , Chetan Mehrotra wrote: For Segment I am not sure how to easily read previous revisions of given NodeState SegmentMK doesn't even have the concept of a previous revision of a NodeState. Michael
[RESULT][VOTE] Release Apache Jackrabbit Oak 1.0.12
Hi, the vote passes as follows: +1 Michael Dürig +1 Amit Jain +1 Alex Parvulescu +1 Davide Giannella +1 Julian Reschke +1 Thomas Mueller I'll push the release out. Thomas, your vote was a bit unclear. Your first statement was a +1 vote. Later you voiced concerns and suggested to not release the candidate as 1.0.12, though you didn't want to have it interpreted as a -1. Maybe you didn't want to vote -1 because you thought it is a veto? This is a majority vote and the release candidate will only fail if we don't reach a majority of +1s or not sufficient +1s (three are required at least). Regards Marcel On 04/03/15 11:07, Marcel Reutegger mreut...@adobe.com wrote: A candidate for the Jackrabbit Oak 1.0.12 release is available at: https://dist.apache.org/repos/dist/dev/jackrabbit/oak/1.0.12/ The release candidate is a zip archive of the sources in: https://svn.apache.org/repos/asf/jackrabbit/oak/tags/jackrabbit-oak-1.0.12 / The SHA1 checksum of the archive is c442265596bb303042b4d3b2e218201d850ec153. A staged Maven repository is available for review at: https://repository.apache.org/ The command for running automated checks against this release candidate is: $ sh check-release.sh oak 1.0.12 c442265596bb303042b4d3b2e218201d850ec153 Please vote on releasing this package as Apache Jackrabbit Oak 1.0.12. The vote is open for the next 72 hours and passes if a majority of at least three +1 Jackrabbit PMC votes are cast. [ ] +1 Release this package as Apache Jackrabbit Oak 1.0.12 [ ] -1 Do not release this package because... Regards Marcel
Re: Parallelize text extraction from binary fields
Hi, On 10 March 2015 at 09:52, Chetan Mehrotra chetan.mehro...@gmail.com wrote: Is Oak already single instance when it comes to the identification and storage of binaries ? Yes. Oak uses content addressable storage for binaries Are the existing TextExtractors also single instance ? No. If same binary is referred at multiple places then text extraction would be performed for each such reference of that binary By Single instance I mean, 1 copy of the binary and its token stream in the repository regardless of how many times its referenced. So based on above token stream would be multiple. What's the approach you are thinking ... and would benefit from 'Single instance' based design? Tokenize once, and store the token stream with the binary so it can be re-used rather than re-processed. Obviously if the content of the binary changes and its not immutable, the token stream has to be re-processed. Best Regards Ian Chetan Mehrotra On Tue, Mar 10, 2015 at 1:15 PM, Ian Boston i...@tfd.co.uk wrote: Hi, Is Oak already single instance when it comes to the identification and storage of binaries ? Are the existing TextExtractors also single instance ? By Single instance I mean, 1 copy of the binary and its token stream in the repository regardless of how many times its referenced. Best Regards Ian On 10 March 2015 at 07:05, Chetan Mehrotra chetan.mehro...@gmail.com wrote: LuceneIndexEditor currently extract the binary contents via Tika in same thread which is used for processing the commit. Such an approach does not make good use of multi processor system specifically when index is being built up as part of migration process. Looking at JR2 I see LazyTextExtractor [1] which I think would help in parallelize text extraction. Would it make sense to bring this to Oak. Would that help in improving performance? Chetan Mehrotra [1] https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java
Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
Hi, I think removing binaries directly without going though the GC logic is dangerous, because we can't be sure if there are other references. There is one exception, it is if each file is guaranteed to be unique. For that, we could for example append a unique UUID to each file. The Lucene file system implementation would need to be changed for that (write the UUID, but ignore it when reading and reading the file size). Even in that case, there is still a risk, for example if the binary _reference_ is copied, or if an old revision is accessed. How do we ensure this does not happen? Regards, Thomas On 10/03/15 07:46, Chetan Mehrotra chetan.mehro...@gmail.com wrote: Hi Team, With storing of Lucene index files within DataStore our usage pattern of DataStore has changed between JR2 and Oak. With JR2 the writes were mostly application based i.e. if application stores a pdf/image file then that would be stored in DataStore. JR2 by default would not write stuff to DataStore. Further in deployment where large number of binary content is present then systems tend to share the DataStore to avoid duplication of storage. In such cases running Blob GC is a non trivial task as it involves a manual step and coordination across multiple deployments. Due to this systems tend to delay frequency of GC Now with Oak apart from application the Oak system itself *actively* uses the DataStore to store the index files for Lucene and there the churn might be much higher i.e. frequency of creation and deletion of index file is lot higher. This would accelerate the rate of garbage generation and thus put lot more pressure on the DataStore storage requirements. Any thoughts on how to avoid/reduce the requirement to increase the frequency of Blob GC? One possible way would be to provide a special cleanup tool which can look for such old Lucene index files and deletes them directly without going through the full fledged MarkAndSweep logic Thoughts? Chetan Mehrotra
Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
Hi Team, With storing of Lucene index files within DataStore our usage pattern of DataStore has changed between JR2 and Oak. With JR2 the writes were mostly application based i.e. if application stores a pdf/image file then that would be stored in DataStore. JR2 by default would not write stuff to DataStore. Further in deployment where large number of binary content is present then systems tend to share the DataStore to avoid duplication of storage. In such cases running Blob GC is a non trivial task as it involves a manual step and coordination across multiple deployments. Due to this systems tend to delay frequency of GC Now with Oak apart from application the Oak system itself *actively* uses the DataStore to store the index files for Lucene and there the churn might be much higher i.e. frequency of creation and deletion of index file is lot higher. This would accelerate the rate of garbage generation and thus put lot more pressure on the DataStore storage requirements. Any thoughts on how to avoid/reduce the requirement to increase the frequency of Blob GC? One possible way would be to provide a special cleanup tool which can look for such old Lucene index files and deletes them directly without going through the full fledged MarkAndSweep logic Thoughts? Chetan Mehrotra
Parallelize text extraction from binary fields
LuceneIndexEditor currently extract the binary contents via Tika in same thread which is used for processing the commit. Such an approach does not make good use of multi processor system specifically when index is being built up as part of migration process. Looking at JR2 I see LazyTextExtractor [1] which I think would help in parallelize text extraction. Would it make sense to bring this to Oak. Would that help in improving performance? Chetan Mehrotra [1] https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java
Re: Parallelize text extraction from binary fields
Hi, Is Oak already single instance when it comes to the identification and storage of binaries ? Are the existing TextExtractors also single instance ? By Single instance I mean, 1 copy of the binary and its token stream in the repository regardless of how many times its referenced. Best Regards Ian On 10 March 2015 at 07:05, Chetan Mehrotra chetan.mehro...@gmail.com wrote: LuceneIndexEditor currently extract the binary contents via Tika in same thread which is used for processing the commit. Such an approach does not make good use of multi processor system specifically when index is being built up as part of migration process. Looking at JR2 I see LazyTextExtractor [1] which I think would help in parallelize text extraction. Would it make sense to bring this to Oak. Would that help in improving performance? Chetan Mehrotra [1] https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java
Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
Hi Chetan, I like the idea. But I wonder: how do you envision that this new index cleanup would locate indexes in the content-addressed DS? Michael On 10 Mar 2015, at 07:46, Chetan Mehrotra chetan.mehro...@gmail.com wrote: Hi Team, With storing of Lucene index files within DataStore our usage pattern of DataStore has changed between JR2 and Oak. With JR2 the writes were mostly application based i.e. if application stores a pdf/image file then that would be stored in DataStore. JR2 by default would not write stuff to DataStore. Further in deployment where large number of binary content is present then systems tend to share the DataStore to avoid duplication of storage. In such cases running Blob GC is a non trivial task as it involves a manual step and coordination across multiple deployments. Due to this systems tend to delay frequency of GC Now with Oak apart from application the Oak system itself *actively* uses the DataStore to store the index files for Lucene and there the churn might be much higher i.e. frequency of creation and deletion of index file is lot higher. This would accelerate the rate of garbage generation and thus put lot more pressure on the DataStore storage requirements. Any thoughts on how to avoid/reduce the requirement to increase the frequency of Blob GC? One possible way would be to provide a special cleanup tool which can look for such old Lucene index files and deletes them directly without going through the full fledged MarkAndSweep logic Thoughts? Chetan Mehrotra