[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140703#comment-14140703 ] Tim Smith commented on LUCENE-5940: --- bq. Reindexing is part and parcel of search i think the general goal should be that this is not the case, especially as search is adopted more and more as replacements for systems that do not have these limitations/requirements (databases). obviously this is an ambitious goal that can likely never be fully realized. also, reindexing comes in 2 distinct flavors: * cold reindexing - rm -rf the index dir, re feed ** requires 2x hardware or downtime * live reindexing - change config, restart system, re feed all docs, change is live once all docs have been reindexed ** obviously a good idea to snapshot any previous index and config so you can restore later on error ** minimal downtime (just restart) ** minimal search interruption (some queries related to the change may not match old documents until reindex is complete) ** old content can be replaced slowly over time to receive full functionality live reindexing does have lots of pitfalls and may not always be viable. for instance, right now it is not possible to add offsets to an index using this approach. as soon as the a new segment is merged with an old one, the offsets are blown away. i had filed a ticket for this. i'm not looking to reopen old wounds here, just pointing out an issue i had with this and had to work around. live reindexing is the goal i strive to achieve when reindexing is required (always comes with a caveat to backup your index first for safety). some smart choices when designing the internal schema can reduce or eliminate many prospective issues here even without any core changes to lucene. bq. it's strongly recommended that it be gathered into an intermediate store these recommendations are always valid to make (and i will make them), however this adds an entire new system to the mix. as well as new hardware, services, maintenance, security, etc. also, given the scale and perhaps complexity of the documents, this may not even be enough and will still require a large amount of processing hardware to process these documents as fast as the index can index them in a reasonable amount of time (days vs months). in general, this is just extra complexity that will be dropped due to the higher price tag and maintenance cost. then, when it finally is time to upgrade the end-user expectation is that oh, we already have the data indexed, why can't we just use that with the new software. this expectation is set due to the fact that many customers/users are used to working with databases. i do not have this expectation myself, however i have people downstream that do have these expectations and i need to do my best to accommodate them whether i like it or not. note, i'm not trying to force any requirements on lucene devs, or soliciting advice on specific functionality, just pointing out some real world use cases i encounter related to discussion here. change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131951#comment-14131951 ] Tim Smith commented on LUCENE-5940: --- i fully understand the reasons for wanting to change the policy here. i absolutely hate maintaining backwards compat myself. its just a nightmare and leaves lots of rotting code laying around waiting to wreak havoc and makes it dicey to add new functionality. i'm fully on board with that sentiment but, i have to support it, and do so in a seamless online manner that is not prone to user error. i also get the feeling a lot of the lucene devs in general don't think full reindexing is an issue and can just be done at any point with minimal cost (just a vibe i've picked up). my experience is that this can be a many months long process (slow sources). this seems to influence support for backwards compatibility, as well as support for changing configuration/schema options, for existing fields, etc by all means, create a good upgrade tool people can use. however, it won't be useful for me and i will need to find a different solution (which will likely result in slowing my adoption of 5.0 when it is released) i am in no way advocating that 5.0 should support reading 3.x indexes. again, i'm just adding my perspective here so informed people can make a decision based on all points of view if the policy changes, i will just have to adapt as necessary change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130121#comment-14130121 ] Tim Smith commented on LUCENE-5940: --- i understand the desire for changing the policy here. i wish i didn't have to care about backwards compat support, but its just the nature of things. people have large indexes that can take a significant amount of time to reindex (due to a slow source, or complex processing) the current proposal here would be problematic for any lucene users who do not release versions in lock step with lucene versions. Solr obviously would have limited issues here since a user could just upgrade to solr 4.99 (assuming 4.99 is the final 4.x version) and then solr 5.0 and no problems. however, if product X released with lucene 4.88 and the last minor version in 4.x line was 4.99, then the upgrade process to get to a lucene 5.0 index is now convoluted and will require creation of custom offline tools to provide an upgrade path. This backwards compatibility requirement is now just shifted from the lucene devs to the lucene users and can no longer be a seamless transition. the current policy does not have these issues since all that i would need to do is fire up the next version, do a forceMerge, and everything is up to date on latest codecs. (no offline processes required, search can continue to work during upgrade) change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130138#comment-14130138 ] Tim Smith commented on LUCENE-5940: --- 5.0 should not be saddled with supporting 3.x index. 100% agree there however, 5.0 should ideally continue to support 4.0-4.99 indexes (at least from the codec/index reading perspective) the best place to handle backwards compat is in the core of lucene. otherwise, you are just going to have uses all over the place doing their own interpretation of backwards compat, getting it wrong, broken, etc. and will subsequently result in lots of irate user filing tickets. if you only support the last minor version from the previous release, it makes it difficult for everyone who was not at that exact minor release. also, to uwe's point the indexupgrade tool is an offline process. also, in my situation, i would need custom packaging of that tool in order to provide ease of use/proper codec usage, etc. vs just fire up index on 5.0 and forceMerge. the custom packaging would also require including an old version of lucene in my project that would be packaged separately, and would just be a nightmare to maintain. alternatively, i would just grab the source for all removed 4.x codecs i need and pull them into my project (this is not ideal since they are no longer maintained by lucene devs and may have dependency issues that would require porting) change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130149#comment-14130149 ] Tim Smith commented on LUCENE-5940: --- time based would be much more reasonable as long as people are on a 4.x release that is less 1-2 years old, they should be able to move directly to 5.0 supporting indexes 4+ years old is asking a bit much, but assuming an external release cycle of 1 year, a 1-2 year cutoff is manageable change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130151#comment-14130151 ] Tim Smith commented on LUCENE-5940: --- firefox does not need to worry about an upgrade path for terabytes worth of data they only need to worry about upgrading bookmarks and thats about it change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130169#comment-14130169 ] Tim Smith commented on LUCENE-5940: --- i fully understand the pain associated with maintaining back compat i guess it would be good if you (and others) could enumerate all the issues related here for full perspective (description does not list them) also, it should be on the developer who removes write support (or removes a codec) to add the backwards compat support/testing. creating a new codec that supplants an old codec should not inherently require removal of write support for old codec. change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130218#comment-14130218 ] Tim Smith commented on LUCENE-5940: --- the problem with the upgrade tool approach is that it doesn't scale to clusters with large numbers of indexes. for instance, a cluster that has 50 indexes spread across a bunch of machines. this is now an involved manual task put in the hands of system administrators who don't really know whats going on under the hood. thats just asking for trouble it seems like the whole power of codecs is that you can avoid all this and allow for seamless transitions by having read only codecs for previous index formats. are there technical issues here i'm unaware of beyond creating and maintaining the backwards compat tests? something outside of the codec mechanism that causes problems? if not, just dump the read only codecs for old versions in an contrib module and let people upgrade at their leisure (and let the community find/fix bugs as they are encountered) change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130307#comment-14130307 ] Tim Smith commented on LUCENE-5940: --- i would not consider old indexes not containing support for new features an issue. if you want to use new options/features/structures, you need to reindex, no problem here. you don't have to convince me that supporting back compat sucks. i agree, but lucene is used by a lot of people for a lot of disparate use cases. removing support for back compat will drive people away since it removes seamless upgrade paths. think what would have happened if microsoft release 64-bit windows with no support for running old 32-bit programs. people still want to run old dos programs on windows (go figure, but they want/need it) it hurts adoption of new versions if you don't provide the back compat. this just leaves a bunch of people running ancient versions of lucene because they don't have any good upgrade path other than complete reindexing. if there is a bug in feature x, a possible solution is to just remove feature x, but this is gonna piss off everyone who relies on it, regardless of how much you may personally hate feature x the main thing i see as a challenge that you mention here is that you want (or new features may require) refactoring the codec api. this is an engineering challenge and would just require some thought out design to decide what final api refactors should be needed to support flexibility, addition of new features, and growth without requiring mucking with old codecs in the future. right now, the IndexWriter and codecs are pretty muddled together in some cases. cleaning up these interfaces and making the codecs self contained should be a goal for any refactors to allow future innovation/addition of features. as a lucene user, if back compat is yanked and not provided in 5.0 for all 4.x indexes, i will be extremely resistant to upgrade. I would be more inclined to fork the latest 4.x and ditch 5.0. 5.0 would have to offer something REALLY compelling to get me to adopt it. change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130314#comment-14130314 ] Tim Smith commented on LUCENE-5940: --- bq. Can you elaborate more? Your example of 50 indexes spread across many machines doesn't make me understand how it would be difficult to run this tool. I see the steps as: here's the issues i would have with an upgrade tool approach here. 1. external network connectivity is not guaranteed 2. i have special metadata written in the segment metadata that is important 3. i use custom codec configuration that upgrade tool would need to use 4. replicated indexes need a lot of care 5. this tool would need to be run once for each directory containing an index, for every node that contains indexes - this is an ops nightmare since i won't personally be running the tool. this leaves lots of room for user error that is avoided completely if the index upgrade is seamless (via read only codecs for old versions) 6. custom directory implementations may muck up the works in general, i don't see any way this upgrade tool would be useful to me without repackaging and adding a ton of extra code to do all the things i need to ensure a consistent index is emitted change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130324#comment-14130324 ] Tim Smith commented on LUCENE-5940: --- bq. Because you are not even considering the developer pain. The tests man, maintaining the tests. the pain will continue to exist, you are just shifting who feels it. again, i get how painful it is, but best to have that pain felt at the source (and handled properly and consistently by people who fully understand it) as opposed to pushing it all downstream, polluting the waters change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.
[ https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130333#comment-14130333 ] Tim Smith commented on LUCENE-5940: --- bq. I don't care what happens on this issue, personally, I'm done working on back compat completely until the policy changes. That includes the current in-progress 4.10.1 release. I've done more than my fair share of fighting it, and it just causes me endless frustration. fully your prerogative, this is a volunteer community. i'm just putting in my 2 cents here since a change here will really be painful to me personally of course i'm not a committer, so i have no final say change index backwards compatibility policy. Key: LUCENE-5940 URL: https://issues.apache.org/jira/browse/LUCENE-5940 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long. The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help. Currently our back compat promise is just a broken promise, because we cannot actually guarantee it for these reasons. I propose we scale back the length of time for which we must support old indexes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5569) Rename AtomicReader to LeafReader
[ https://issues.apache.org/jira/browse/LUCENE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959096#comment-13959096 ] Tim Smith commented on LUCENE-5569: --- -1 please don't do this renaming things for the sake of renaming them is a horrible burden on people using these apis for instance, every single minor version of lucene 4.x has broken api signatures, resulting in hours, or days worth of time to reconcile the changes add in a major name change like this and it adds in significant noise to fixing any real compile errors and significantly complicates the porting process (it took me weeks to upgrade from lucene 3.x to 4.x, i don't want to do that again) AtomicReader is a public api in lucene and should not be renamed just because a new name seems better Rename AtomicReader to LeafReader - Key: LUCENE-5569 URL: https://issues.apache.org/jira/browse/LUCENE-5569 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Priority: Minor Fix For: 5.0 See LUCENE-5527 for more context: several of us seem to prefer {{Leaf}} to {{Atomic}}. Talking from my experience, I was a bit confused in the beginning that this thing is named {{AtomicReader}}, since {{Atomic}} is otherwise used in Java in the context of concurrency. So maybe renaming it to {{Leaf}} would help remove this confusion and also carry the information that these readers are used as leaves of top-level readers? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files
[ https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923910#comment-13923910 ] Tim Smith commented on LUCENE-5492: --- Here's what my test is doing: 1. unpacks lucene 3.x era index (has one segment in it) 2. opens IndexWriter on 3.x index 3. opens DirectoryReader using IndexWriter 4. Add 1 new document 5. commit IndexWriter 6. reopens DirectoryReader using IndexWriter 7. optimizes IndexWriter 8. commit optimized index 9. reopens DirectoryReader using IndexWriter One thing of note is that i have a custom IndexDeletionPolicy this policy will hold onto named commit points i hold onto the previous commit point at commit time, and then release it shortly after the commit is finished, once i have persisted my acceptance of the new commit point (calling deleteUnusedFiles() to purge it) IndexFileDeleter AssertionError in presence of *_upgraded.si files -- Key: LUCENE-5492 URL: https://issues.apache.org/jira/browse/LUCENE-5492 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.7 Reporter: Tim Smith Assignee: Michael McCandless When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x segments, i am seeing the following exception: {code} java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 pre-decrement for file _0_upgraded.si at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630) at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514) at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286) at org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393) at org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617) {code} I believe this is caused by IndexFileDeleter not being aware of the Lucene3x Segment Infos Format (notably the _upgraded.si files created to upgrade an old index) This is new in 4.7 and did not occur in 4.6.1 Still trying to track down a workaround/fix -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files
[ https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922646#comment-13922646 ] Tim Smith commented on LUCENE-5492: --- Narrowing it down definitely seeing a reference count issue this only seems to occur when using DirectoryReader.open(IndexWriter ...) methods for one particular commit point segments_4, i see the following refcount behavior: * incref segments_4 ** incref _0_upgraded.si refcount=3 ** decref _0_upgraded.si refcount=2 * incref segments_4 ** NOTE: _0_upgraded.si not incref'd this time * ... * delete segments_4 ** decref _0_upgraded.si ERROR IndexFileDeleter AssertionError in presence of *_upgraded.si files -- Key: LUCENE-5492 URL: https://issues.apache.org/jira/browse/LUCENE-5492 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.7 Reporter: Tim Smith When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x segments, i am seeing the following exception: {code} java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 pre-decrement for file _0_upgraded.si at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630) at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514) at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286) at org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393) at org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617) {code} I believe this is caused by IndexFileDeleter not being aware of the Lucene3x Segment Infos Format (notably the _upgraded.si files created to upgrade an old index) This is new in 4.7 and did not occur in 4.6.1 Still trying to track down a workaround/fix -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files
[ https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922847#comment-13922847 ] Tim Smith commented on LUCENE-5492: --- that seems to be the culprit in my IndexWriter subclass, i overrode incRefDeleter and decRefDeleter to be no-ops and it no longer fails horribly hopefully this doesn't have any negative effects (looks like that was all that was in the patch on LUCENE-5434, so worst case scenario i just don't get to take advantage of the benefits there) IndexFileDeleter AssertionError in presence of *_upgraded.si files -- Key: LUCENE-5492 URL: https://issues.apache.org/jira/browse/LUCENE-5492 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.7 Reporter: Tim Smith When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x segments, i am seeing the following exception: {code} java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 pre-decrement for file _0_upgraded.si at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630) at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514) at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286) at org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393) at org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617) {code} I believe this is caused by IndexFileDeleter not being aware of the Lucene3x Segment Infos Format (notably the _upgraded.si files created to upgrade an old index) This is new in 4.7 and did not occur in 4.6.1 Still trying to track down a workaround/fix -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files
Tim Smith created LUCENE-5492: - Summary: IndexFileDeleter AssertionError in presence of *_upgraded.si files Key: LUCENE-5492 URL: https://issues.apache.org/jira/browse/LUCENE-5492 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.7 Reporter: Tim Smith When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x segments, i am seeing the following exception: {code} java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 pre-decrement for file _0_upgraded.si at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630) at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514) at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286) at org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393) at org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617) {code} I believe this is caused by IndexFileDeleter not being aware of the Lucene3x Segment Infos Format (notably the _upgraded.si files created to upgrade an old index) This is new in 4.7 and did not occur in 4.6.1 Still trying to track down a workaround/fix -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files
[ https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921398#comment-13921398 ] Tim Smith commented on LUCENE-5492: --- to the best of my knowledge, i don't think its something crazy or wrong i'm doing on my part still trying to get to the bottom of it seems to be related to the accounting files in a SegmentInfos not behaving properly for legacy 3.x segments IndexFileDeleter AssertionError in presence of *_upgraded.si files -- Key: LUCENE-5492 URL: https://issues.apache.org/jira/browse/LUCENE-5492 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.7 Reporter: Tim Smith When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x segments, i am seeing the following exception: {code} java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 pre-decrement for file _0_upgraded.si at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630) at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514) at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286) at org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393) at org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617) {code} I believe this is caused by IndexFileDeleter not being aware of the Lucene3x Segment Infos Format (notably the _upgraded.si files created to upgrade an old index) This is new in 4.7 and did not occur in 4.6.1 Still trying to track down a workaround/fix -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files
[ https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921432#comment-13921432 ] Tim Smith edited comment on LUCENE-5492 at 3/5/14 9:18 PM: --- The following FileNotFound exception is firing: {code} java.io.FileNotFoundException: target/data-16000/mockEngine/index/_0.si (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:241) at org.apache.lucene.store.FSDirectory$FSIndexInput.init(FSDirectory.java:382) at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.init(NIOFSDirectory.java:127) at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80) at org.apache.lucene.codecs.lucene3x.Lucene3xSegmentInfoReader.read(Lucene3xSegmentInfoReader.java:103) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:340) at org.apache.lucene.index.IndexFileDeleter.init(IndexFileDeleter.java:175) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:773) {code} this results in IndexFileDeleter ignoring the segment (skipping incRef()), resulting in 0 refcounts for said files then, the CommitPoint is deleted (which does reference the files properly), and the files are decRef'd, resulting in the exception was (Author: tsmith): The following FileNotFound exception is firing: {code} java.io.FileNotFoundException: /home/tsmith/src/attivio/app/target/data-16000/mockEngine/index/_0.si (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:241) at org.apache.lucene.store.FSDirectory$FSIndexInput.init(FSDirectory.java:382) at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.init(NIOFSDirectory.java:127) at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80) at org.apache.lucene.codecs.lucene3x.Lucene3xSegmentInfoReader.read(Lucene3xSegmentInfoReader.java:103) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:340) at org.apache.lucene.index.IndexFileDeleter.init(IndexFileDeleter.java:175) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:773) {code} this results in IndexFileDeleter ignoring the segment (skipping incRef()), resulting in 0 refcounts for said files then, the CommitPoint is deleted (which does reference the files properly), and the files are decRef'd, resulting in the exception IndexFileDeleter AssertionError in presence of *_upgraded.si files -- Key: LUCENE-5492 URL: https://issues.apache.org/jira/browse/LUCENE-5492 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.7 Reporter: Tim Smith When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x segments, i am seeing the following exception: {code} java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 pre-decrement for file _0_upgraded.si at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630) at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514) at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286) at org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393) at org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617) {code} I believe this is caused by IndexFileDeleter not being aware of the Lucene3x Segment Infos Format (notably the _upgraded.si files created to upgrade an old index) This is new in 4.7 and did not occur in 4.6.1 Still trying to track down a workaround/fix -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files
[ https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921432#comment-13921432 ] Tim Smith commented on LUCENE-5492: --- The following FileNotFound exception is firing: {code} java.io.FileNotFoundException: /home/tsmith/src/attivio/app/target/data-16000/mockEngine/index/_0.si (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:241) at org.apache.lucene.store.FSDirectory$FSIndexInput.init(FSDirectory.java:382) at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.init(NIOFSDirectory.java:127) at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80) at org.apache.lucene.codecs.lucene3x.Lucene3xSegmentInfoReader.read(Lucene3xSegmentInfoReader.java:103) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:340) at org.apache.lucene.index.IndexFileDeleter.init(IndexFileDeleter.java:175) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:773) {code} this results in IndexFileDeleter ignoring the segment (skipping incRef()), resulting in 0 refcounts for said files then, the CommitPoint is deleted (which does reference the files properly), and the files are decRef'd, resulting in the exception IndexFileDeleter AssertionError in presence of *_upgraded.si files -- Key: LUCENE-5492 URL: https://issues.apache.org/jira/browse/LUCENE-5492 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.7 Reporter: Tim Smith When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x segments, i am seeing the following exception: {code} java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 pre-decrement for file _0_upgraded.si at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630) at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514) at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286) at org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393) at org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617) {code} I believe this is caused by IndexFileDeleter not being aware of the Lucene3x Segment Infos Format (notably the _upgraded.si files created to upgrade an old index) This is new in 4.7 and did not occur in 4.6.1 Still trying to track down a workaround/fix -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files
[ https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921452#comment-13921452 ] Tim Smith commented on LUCENE-5492: --- FileNotFound was actually triggered later (as things were shutting down, after the initial assertion tripped) my current theory is that the .si and _upgraded.si files are not being registered into the index file deleter properly, or it is somehow double decref'd i see the upgraded.si and .si file get decref'd and deleted, followed by another decref, which trips the assert IndexFileDeleter AssertionError in presence of *_upgraded.si files -- Key: LUCENE-5492 URL: https://issues.apache.org/jira/browse/LUCENE-5492 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.7 Reporter: Tim Smith When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x segments, i am seeing the following exception: {code} java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 pre-decrement for file _0_upgraded.si at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630) at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514) at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286) at org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393) at org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617) {code} I believe this is caused by IndexFileDeleter not being aware of the Lucene3x Segment Infos Format (notably the _upgraded.si files created to upgrade an old index) This is new in 4.7 and did not occur in 4.6.1 Still trying to track down a workaround/fix -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4671) CharsRef.subSequence broken
Tim Smith created LUCENE-4671: - Summary: CharsRef.subSequence broken Key: LUCENE-4671 URL: https://issues.apache.org/jira/browse/LUCENE-4671 Project: Lucene - Core Issue Type: Bug Reporter: Tim Smith Looks like CharsRef.subSequence() is currently broken It is implemented as: {code} @Override public CharSequence subSequence(int start, int end) { // NOTE: must do a real check here to meet the specs of CharSequence if (start 0 || end length || start end) { throw new IndexOutOfBoundsException(); } return new CharsRef(chars, offset + start, offset + end); } {code} Since CharsRef constructor is (char[] chars, int offset, int length), Should Be: {code} @Override public CharSequence subSequence(int start, int end) { // NOTE: must do a real check here to meet the specs of CharSequence if (start 0 || end length || start end) { throw new IndexOutOfBoundsException(); } return new CharsRef(chars, offset + start, end - start); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4671) CharsRef.subSequence broken
[ https://issues.apache.org/jira/browse/LUCENE-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548841#comment-13548841 ] Tim Smith commented on LUCENE-4671: --- looks like the index out of bounds check is a bit off too (if someone ever uses non-zero offsets) check should probably be: {code} if (start offset || end (offset + length) || start end) { throw new IndexOutOfBoundsException(); } {code} CharsRef.subSequence broken --- Key: LUCENE-4671 URL: https://issues.apache.org/jira/browse/LUCENE-4671 Project: Lucene - Core Issue Type: Bug Reporter: Tim Smith Assignee: Robert Muir Attachments: LUCENE-4671.patch Looks like CharsRef.subSequence() is currently broken It is implemented as: {code} @Override public CharSequence subSequence(int start, int end) { // NOTE: must do a real check here to meet the specs of CharSequence if (start 0 || end length || start end) { throw new IndexOutOfBoundsException(); } return new CharsRef(chars, offset + start, offset + end); } {code} Since CharsRef constructor is (char[] chars, int offset, int length), Should Be: {code} @Override public CharSequence subSequence(int start, int end) { // NOTE: must do a real check here to meet the specs of CharSequence if (start 0 || end length || start end) { throw new IndexOutOfBoundsException(); } return new CharsRef(chars, offset + start, end - start); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Deleted] (LUCENE-4671) CharsRef.subSequence broken
[ https://issues.apache.org/jira/browse/LUCENE-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-4671: -- Comment: was deleted (was: looks like the index out of bounds check is a bit off too (if someone ever uses non-zero offsets) check should probably be: {code} if (start offset || end (offset + length) || start end) { throw new IndexOutOfBoundsException(); } {code}) CharsRef.subSequence broken --- Key: LUCENE-4671 URL: https://issues.apache.org/jira/browse/LUCENE-4671 Project: Lucene - Core Issue Type: Bug Reporter: Tim Smith Assignee: Robert Muir Attachments: LUCENE-4671.patch Looks like CharsRef.subSequence() is currently broken It is implemented as: {code} @Override public CharSequence subSequence(int start, int end) { // NOTE: must do a real check here to meet the specs of CharSequence if (start 0 || end length || start end) { throw new IndexOutOfBoundsException(); } return new CharsRef(chars, offset + start, offset + end); } {code} Since CharsRef constructor is (char[] chars, int offset, int length), Should Be: {code} @Override public CharSequence subSequence(int start, int end) { // NOTE: must do a real check here to meet the specs of CharSequence if (start 0 || end length || start end) { throw new IndexOutOfBoundsException(); } return new CharsRef(chars, offset + start, end - start); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4671) CharsRef.subSequence broken
[ https://issues.apache.org/jira/browse/LUCENE-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548843#comment-13548843 ] Tim Smith commented on LUCENE-4671: --- looks good CharsRef.subSequence broken --- Key: LUCENE-4671 URL: https://issues.apache.org/jira/browse/LUCENE-4671 Project: Lucene - Core Issue Type: Bug Reporter: Tim Smith Assignee: Robert Muir Attachments: LUCENE-4671.patch Looks like CharsRef.subSequence() is currently broken It is implemented as: {code} @Override public CharSequence subSequence(int start, int end) { // NOTE: must do a real check here to meet the specs of CharSequence if (start 0 || end length || start end) { throw new IndexOutOfBoundsException(); } return new CharsRef(chars, offset + start, offset + end); } {code} Since CharsRef constructor is (char[] chars, int offset, int length), Should Be: {code} @Override public CharSequence subSequence(int start, int end) { // NOTE: must do a real check here to meet the specs of CharSequence if (start 0 || end length || start end) { throw new IndexOutOfBoundsException(); } return new CharsRef(chars, offset + start, end - start); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4671) CharsRef.subSequence broken
[ https://issues.apache.org/jira/browse/LUCENE-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548848#comment-13548848 ] Tim Smith commented on LUCENE-4671: --- it is, that's why i deleted the comment, just looked wrong to me for a moment CharsRef.subSequence broken --- Key: LUCENE-4671 URL: https://issues.apache.org/jira/browse/LUCENE-4671 Project: Lucene - Core Issue Type: Bug Reporter: Tim Smith Assignee: Robert Muir Attachments: LUCENE-4671.patch Looks like CharsRef.subSequence() is currently broken It is implemented as: {code} @Override public CharSequence subSequence(int start, int end) { // NOTE: must do a real check here to meet the specs of CharSequence if (start 0 || end length || start end) { throw new IndexOutOfBoundsException(); } return new CharsRef(chars, offset + start, offset + end); } {code} Since CharsRef constructor is (char[] chars, int offset, int length), Should Be: {code} @Override public CharSequence subSequence(int start, int end) { // NOTE: must do a real check here to meet the specs of CharSequence if (start 0 || end length || start end) { throw new IndexOutOfBoundsException(); } return new CharsRef(chars, offset + start, end - start); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537132#comment-13537132 ] Tim Smith commented on LUCENE-4560: --- i found a 100% pure codec approach for providing all the functionality i require here and more, requiring no patches if any committer has interest in pushing this ticket forward, i can clean up patch/add suggestions, etc, otherwise this ticket can be closed Support Filtering Segments During Merge --- Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Attachments: LUCENE-4560.patch, LUCENE-4560-simple.patch Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537223#comment-13537223 ] Tim Smith commented on LUCENE-4560: --- codec approach i'm taking is pretty specific, incorporating my schema/configuration to allow migrating/enhancing options/features/indexing formats/etc (still exploring all the possibilities) there may be a few things that would reduce the overhead/enhance the ease the implementation. i will create new tickets with patches as i identify them. NOTE: the codec api is very nice. congrats to all involved in making that happen. Support Filtering Segments During Merge --- Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Attachments: LUCENE-4560.patch, LUCENE-4560-simple.patch Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4272) another idea for updatable fields
[ https://issues.apache.org/jira/browse/LUCENE-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537449#comment-13537449 ] Tim Smith commented on LUCENE-4272: --- +1 on term vector approach I would like to see the following added to IndexableField: /** Expert. index inverted terms for field */ public Terms invertedTerms(); this would allow partial updates via term vectors without having to flatten back into TokenStream first This would also facilitate things like the following: * index document into memory index * run alert queries/per-doc analysis against memory index * get terms from memory index for all fields and index into on disk index using IndexableField.invertedTerms() * double tokenization/analysis/inversion is now avoided another idea for updatable fields - Key: LUCENE-4272 URL: https://issues.apache.org/jira/browse/LUCENE-4272 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir I've been reviewing the ideas for updatable fields and have an alternative proposal that I think would address my biggest concern: * not slowing down searching When I look at what Solr and Elasticsearch do here, by basically reindexing from stored fields, I think they solve a lot of the problem: users don't have to rebuild their document from scratch just to update one tiny piece. But I think we can do this more efficiently: by avoiding reindexing of the unaffected fields. The basic idea is that we would require term vectors for this approach (as the already store a serialized indexed version of the doc), and so we could just take the other pieces from the existing vectors for the doc. I think we would have to extend vectors to also store the norm (so we dont recompute that), and payloads, but it seems feasible at a glance. I dont think we should discard the idea because vectors are slow/big today, this seems like something we could fix. Personally I like the idea of not slowing down search performance to solve the problem, I think we should really start from that angle and work towards making the indexing side more efficient, not vice-versa. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500226#comment-13500226 ] Tim Smith commented on LUCENE-4560: --- The gradual approach is very much required. Its possible that a config change by a user will result in the need to do a filtered reader on a merge. For instance, if you index a field without offsets, then you shutdown, start up with indexing of offsets. Currently, this situation will result in newly indexed offsets being obliterated on merge (LUCENE-4557) with no possible way to save them. Especially in this case, the addIndexes() approach is way too costly just for a small configuration change. Small config changes shouldn't require the equivalent of a full optimize to take effect. Also, i argue that any addIndexes() approach is even more dangerous and just as prone to corruption. This can result in the same filtering of readers as the attached patch provides, however it modifies the entire index, thereby causing any corruption to be much more widespread. (of course either way, it is up to the person implementing their custom filter to guarantee that no corruption occurs and that their code produces consistent indexes) I will look into the MergePolicy approach. Off hand, it looks like this may still require a patch as the SegmentMerger is currently only aware of SegmentReaders from merging, however i may be able to add my own SegmentInfo's to the OneMerge replacing the codec with a wrapped codec that will apply my filtering. it'll be about a week before i can get back to testing this, i'll report back then. Support Filtering Segments During Merge --- Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Attachments: LUCENE-4560.patch Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500275#comment-13500275 ] Tim Smith commented on LUCENE-4560: --- offsets can be used for highlighting users want to configure highlighting per field users don't always know what fields they want to highlight and may change this setting frequently setting highlighting=true on a field should be fully possible without full reindex required (old documents of course will not be highlighted, or may default to a slower highlighting method that does not require offsets) (slowly refeeding old documents will allow users to get full functionality for old docs as well, however refeeding may take weeks and should not impact indexing of new content) i can't proactively always enable offsets on the off chance they will enable highlighting in the future as this implies additional disk requirements this is the primary use case that spawned this ticket right now, due to the merging behavior, i cannot use indexed offsets for highlighting as a setting change will result in merges destroying offsets. this filtering merge reader approach also fulfills other requirements i have for migrating old indexed content to use new features so it would be a win-win for me to use this filtered merge reader approach to ensure consistency and conformance with my schema. Support Filtering Segments During Merge --- Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Attachments: LUCENE-4560.patch Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500291#comment-13500291 ] Tim Smith commented on LUCENE-4560: --- bq. Its been this way since even 2.x, if you omitTF, then later decide you want TF and positions, you need to re-index. re-index is the key word here re-indexing is not something that can always be done, or implies a massive cost. changing a schema setting for one field should not require a full re-index. i'm afraid i'm in a world where re-index is a 4 letter word and should only be done in the most extreme of circumstances. my whole point here is that migration should be possible via a pluggable policy bq. Thats why these are expert options. i know these are expert options, but there should also be a means to support migration to new settings (albeit an expert means to do so that may have some consequences for how old documents were indexed) Support Filtering Segments During Merge --- Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Attachments: LUCENE-4560.patch Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500318#comment-13500318 ] Tim Smith commented on LUCENE-4560: --- A migration strategy does exist and is very simple. It is up to the implementer to determine how data will be migrated and properly communicate that to the user base so expectations are set properly. All migration will have pros and cons, and my require gradual reindexing of content to ensure consistency for old documents. but this is up to the implementer, and shouldn't be imposed by the lucene apis. Lets analyze the highlighting case based on indexed offsets. Assume documents were indexed with no offsets. Highlighting was being done for these documents using tokenstream based highlighting based on stored field text. Now, the user switches to using a more efficient offsets based highlighting. new documents will be indexed with offsets. Right now, assuming no merging was done, it is very easy to see if a document has indexed offsets and on a per-document basis documents can be highlighted according to what was indexed. Then a merge happens. (currently, this will force tokenstream based highlighting for all documents, undoing the configuration setting) If applying a migration policy, old documents can have 0,0 offsets applied. (this is the decision of the migration policy and is up to the implementer of the migration policy) Now, when highlighting is applied, if all positions have a 0,0 offset for a document, it can fall back to tokenstream based highlighting. if positions have offsets, it will use them to perform optimal, full-featured highlighting. This will result in slightly slower highlighting for old documents. user experience can then be improved by doing a gradual reindex of old documents, without requiring user to blast away their existing index. Support Filtering Segments During Merge --- Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Attachments: LUCENE-4560.patch Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500322#comment-13500322 ] Tim Smith commented on LUCENE-4560: --- Uwe, i plan to investigate your suggestions, and it may result in not requiring any additional patching to lucene. it'll be about a week before i can get to that and i will post my results then. i still don't see the addIndexes() approach as viable, even how you suggest as that will require up front migration steps instead of gradual migration. The merge policy approach you suggested will likely be more useful to me, however this will be a nasty merge policy. Support Filtering Segments During Merge --- Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Attachments: LUCENE-4560.patch, LUCENE-4560-simple.patch Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1346#comment-1346 ] Tim Smith commented on LUCENE-4560: --- My base requirement here is that this be an online process. As such, the add indexes approach is really not useful as i see it, especially as it requires 2x disk space, as well as completely new index directories, it does not play well with upgrading a user's existing index. what i see as needed is the ability to gradually migrate indexes such that any individual segment is itself consistent. currently, merging of indexes can result in loss of indexed data or otherwise break consistency, as in LUCENE-4557 it is 100% ok if all segments have not been processed as i can identify each segment's settings at index open/search time, and optionally filter/search/read segments differently. It is true that once you start using this SegmentMergeFilter, you pretty much have to keep using it forever. I don't see this as an issue as when dealing with supporting old indexes, you constantly have to support migration of data that was indexed using old code. For instance, as time goes on, my MergeSegmentFilter will do more, supporting migrating more and more old index formats/config settings to the latest indexing format/settings. At quick glance, FilteringCodec looks like it applies to writing new content, not reading existing indexes? Doesn't seem quite like that would do the trick here. I would need some way to have the index writer wrap the codec for existing segments in order to inject my custom filtering that would apply during merging. That would be logically identical to the patch provided, however would potentially result in a much more complex patch. Support Filtering Segments During Merge --- Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Attachments: LUCENE-4560.patch Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4560) Support Filtering Segments During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-4560: -- Lucene Fields: New,Patch Available (was: New) Support Filtering Segments During Merge --- Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Attachments: LUCENE-4560.patch Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4560) Support Filtering Segments During Merge
Tim Smith created LUCENE-4560: - Summary: Support Filtering Segments During Merge Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13498079#comment-13498079 ] Tim Smith commented on LUCENE-4557: --- Spun off LUCENE-4560 for supporting filtering during segment merging. patch will be forthcoming shortly as long as that gains traction and makes it in, i will be happy (this will actually fulfill numerous other use cases i have). I still consider this issue a bug given indexed content is lost and would recommend against closing this ticket, however LUCENE-4560 will provide a more than adequate solution for my needs. Indexed Offsets Can Be Lost During Merge Key: LUCENE-4557 URL: https://issues.apache.org/jira/browse/LUCENE-4557 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.0 Reporter: Tim Smith Attachments: OffsetsTest.java Primary Use case: Start with pre-4.0 index (no indexed offsets available) Start indexing new documents with indexed offsets (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) merge/optimize index newly indexed documents will now no longer have offsets available In general, it is impossible to ever change a field to have offsets indexed when starting with an existing index as a merge will cause offsets to be removed from the index. Desirable behavior would be for new documents to have offsets indexed properly, and old documents would have offset of 0, 0 for all positions after merging with a segment that contains offsets Current behavior can be very dangerous. for example: * Start indexing documents with indexed offsets * change config to not index offsets by accident * index 1 document * revert config back * offsets will start disappearing from documents as segments are merged -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4560) Support Filtering Segments During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-4560: -- Attachment: LUCENE-4560.patch Attaching patch patch adds MergeSegmentFilter base class and adds config setter akin to IndexReaderWarmer on IndexWriterConfig (by all means, suggest better names) SegmentMerger will use this (if specified) to filter any segments being merged Test case included that uses filter to remove an indexed field during merge. Support Filtering Segments During Merge --- Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Attachments: LUCENE-4560.patch Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4560) Support Filtering Segments During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13498087#comment-13498087 ] Tim Smith edited comment on LUCENE-4560 at 11/15/12 4:03 PM: - Attaching patch patch adds MergedSegmentFilter base class and adds config setter akin to IndexReaderWarmer on IndexWriterConfig (by all means, suggest better names) SegmentMerger will use this (if specified) to filter any segments being merged Test case included that uses filter to remove an indexed field during merge. was (Author: tsmith): Attaching patch patch adds MergeSegmentFilter base class and adds config setter akin to IndexReaderWarmer on IndexWriterConfig (by all means, suggest better names) SegmentMerger will use this (if specified) to filter any segments being merged Test case included that uses filter to remove an indexed field during merge. Support Filtering Segments During Merge --- Key: LUCENE-4560 URL: https://issues.apache.org/jira/browse/LUCENE-4560 Project: Lucene - Core Issue Type: Improvement Reporter: Tim Smith Attachments: LUCENE-4560.patch Spun off from LUCENE-4557 It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually Use Cases: * Migrate IndexOptions for fields (See LUCENE-4557) * Gradually Remove index fields no longer used * Migrate indexed sort fields to DocValues * Support converting data types for indexed data * and so on patch will be forthcoming -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge
Tim Smith created LUCENE-4557: - Summary: Indexed Offsets Can Be Lost During Merge Key: LUCENE-4557 URL: https://issues.apache.org/jira/browse/LUCENE-4557 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.0 Reporter: Tim Smith Attachments: OffsetsTest.java Primary Use case: Start with pre-4.0 index (no indexed offsets available) Start indexing new documents with indexed offsets (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) merge/optimize index newly indexed documents will now no longer have offsets available In general, it is impossible to ever change a field to have offsets indexed when starting with an existing index as a merge will cause offsets to be removed from the index. Desirable behavior would be for new documents to have offsets indexed properly, and old documents would have offset of 0, 0 for all positions after merging with a segment that contains offsets Current behavior can be very dangerous. for example: * Start indexing documents with indexed offsets * change config to not index offsets by accident * index 1 document * revert config back * offsets will start disappearing from documents as segments are merged -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-4557: -- Attachment: OffsetsTest.java Attaching test that shows issue Indexed Offsets Can Be Lost During Merge Key: LUCENE-4557 URL: https://issues.apache.org/jira/browse/LUCENE-4557 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.0 Reporter: Tim Smith Attachments: OffsetsTest.java Primary Use case: Start with pre-4.0 index (no indexed offsets available) Start indexing new documents with indexed offsets (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) merge/optimize index newly indexed documents will now no longer have offsets available In general, it is impossible to ever change a field to have offsets indexed when starting with an existing index as a merge will cause offsets to be removed from the index. Desirable behavior would be for new documents to have offsets indexed properly, and old documents would have offset of 0, 0 for all positions after merging with a segment that contains offsets Current behavior can be very dangerous. for example: * Start indexing documents with indexed offsets * change config to not index offsets by accident * index 1 document * revert config back * offsets will start disappearing from documents as segments are merged -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith reopened LUCENE-4557: --- I disagree with that assessment the problem here is not that offsets are not available on old docs the problem is that offsets are destroyed on documents that had them set properly. This is very much a bug. a small temporary config mistake by a user can cause destruction of indexed data during merging. even after corrected. As far as i'm concerned, this issue makes it unfeasible to ever use indexed offsets even though i very much want to. reindexing data is quite often out of the question when large indexes are involved. Indexed Offsets Can Be Lost During Merge Key: LUCENE-4557 URL: https://issues.apache.org/jira/browse/LUCENE-4557 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.0 Reporter: Tim Smith Attachments: OffsetsTest.java Primary Use case: Start with pre-4.0 index (no indexed offsets available) Start indexing new documents with indexed offsets (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) merge/optimize index newly indexed documents will now no longer have offsets available In general, it is impossible to ever change a field to have offsets indexed when starting with an existing index as a merge will cause offsets to be removed from the index. Desirable behavior would be for new documents to have offsets indexed properly, and old documents would have offset of 0, 0 for all positions after merging with a segment that contains offsets Current behavior can be very dangerous. for example: * Start indexing documents with indexed offsets * change config to not index offsets by accident * index 1 document * revert config back * offsets will start disappearing from documents as segments are merged -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497208#comment-13497208 ] Tim Smith commented on LUCENE-4557: --- i understand the similarity to the omitTF case, but would argue that too is a bug the main issue here is with merging merging currently seems to choose the most restrictive case for IndexOptions for a field instead of the most general when you are writing new segments and you provide contradictory IndexOptions for the same field, it is ok for the writer to produce new segments with the most restrictive set (or throw an exception at this point), i have no argument there however, when it comes to merging existing segments, no indexed data should be lost (as in this case) if you have 2 segments with the following: Segment 1: docs and freqs and positions Segment 2: docs and freqs and positions and offsets the merged segment should have the following Merged: docs and freqs and positions and offsets the offsets for docs that were part of segment 1 should be null/(start=0, end=0), or better yet (-1, -1) if possible the offsets for docs that were part of segment 2 should be the proper offsets that were indexed for segment 2 in the first place The same rule could also be applied to the omit tf case: Segment 1: Docs Only Segment 2: Docs And Freqs And Positions Merged: docs and freqs and positions docs from segment 1 should have frequency 1 and a single position of 0 Indexed Offsets Can Be Lost During Merge Key: LUCENE-4557 URL: https://issues.apache.org/jira/browse/LUCENE-4557 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.0 Reporter: Tim Smith Attachments: OffsetsTest.java Primary Use case: Start with pre-4.0 index (no indexed offsets available) Start indexing new documents with indexed offsets (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) merge/optimize index newly indexed documents will now no longer have offsets available In general, it is impossible to ever change a field to have offsets indexed when starting with an existing index as a merge will cause offsets to be removed from the index. Desirable behavior would be for new documents to have offsets indexed properly, and old documents would have offset of 0, 0 for all positions after merging with a segment that contains offsets Current behavior can be very dangerous. for example: * Start indexing documents with indexed offsets * change config to not index offsets by accident * index 1 document * revert config back * offsets will start disappearing from documents as segments are merged -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497278#comment-13497278 ] Tim Smith commented on LUCENE-4557: --- i understand your aversion to what i suggest, however i still argue this is a pretty nasty bug given that indexed content is lost i also argue that it should be fully supported to change settings on fields as time goes on, especially the ability to make the field more general (add positions/offsets/insertnewfeaturehere). Old data would of course be limited to the settings the data was indexed with. However, new content should not be restricted to old settings. Without supporting this, you are forcing full reindexes in situations that really should not require it. This is a big red flag in my opinion. from what i understand of your FilterReader suggestion, it would require me to do the equivalent of an index optimize in order to upgrade/convert the index to the have (0,0) offsets on segments that were lacking this setting? This seems extremely expensive, and would require me to detect this situation at index startup time, and then spend very large amounts of time performing the conversion all blocking indexing from continuing until this operation is over. Controlling this behavior at merge time seems to be the appropriate place. As long as i could control the merge behavior via a pluggable/configurable API i would be happy, and any other users that encounter this issue would also have a means to address it. Looks like merging of segments data is not exposed at all, so right now there is no way to handle this situation properly. For instance, if i could wrap the SegmentReader at merge time to provide null offsets that would be fine. Ideally, there would be some means to still support efficient bulk merging of stored fields/term vectors etc. Indexed Offsets Can Be Lost During Merge Key: LUCENE-4557 URL: https://issues.apache.org/jira/browse/LUCENE-4557 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.0 Reporter: Tim Smith Attachments: OffsetsTest.java Primary Use case: Start with pre-4.0 index (no indexed offsets available) Start indexing new documents with indexed offsets (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) merge/optimize index newly indexed documents will now no longer have offsets available In general, it is impossible to ever change a field to have offsets indexed when starting with an existing index as a merge will cause offsets to be removed from the index. Desirable behavior would be for new documents to have offsets indexed properly, and old documents would have offset of 0, 0 for all positions after merging with a segment that contains offsets Current behavior can be very dangerous. for example: * Start indexing documents with indexed offsets * change config to not index offsets by accident * index 1 document * revert config back * offsets will start disappearing from documents as segments are merged -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497354#comment-13497354 ] Tim Smith commented on LUCENE-4557: --- i know you aren't changing your mind i also disagree with calling this fake data the data would be 100% representative of what was indexed what i would at least like to see is a reasonable means to support this functionality. I propose some means to support more pluggable segment merging: for instance, if IndexWriter had the following method: {code} public AtomicReader getSegmentForMerge(SegmentReader reader) { return reader; // default implementation does nothing. {code} then i could override this method, wrap reader and enhance its indexed content as it is merging in order to fulfill my requirements. This would have additional benefits including but not limited to: * Supporting migration of IndexOptions on fields * Supporting migration of sort fields from indexed fields to DocValues * Support converting data types for DocValues * and so on This wrapping would just need to be smart (a good MergeSegmentReader base class that SegmentMerger is integrated with) in order to optimize bulk merges of stored fields/termvectors/etc if this is a more palatable approach for you, i can work up a patch as i find time Indexed Offsets Can Be Lost During Merge Key: LUCENE-4557 URL: https://issues.apache.org/jira/browse/LUCENE-4557 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.0 Reporter: Tim Smith Attachments: OffsetsTest.java Primary Use case: Start with pre-4.0 index (no indexed offsets available) Start indexing new documents with indexed offsets (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) merge/optimize index newly indexed documents will now no longer have offsets available In general, it is impossible to ever change a field to have offsets indexed when starting with an existing index as a merge will cause offsets to be removed from the index. Desirable behavior would be for new documents to have offsets indexed properly, and old documents would have offset of 0, 0 for all positions after merging with a segment that contains offsets Current behavior can be very dangerous. for example: * Start indexing documents with indexed offsets * change config to not index offsets by accident * index 1 document * revert config back * offsets will start disappearing from documents as segments are merged -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge
[ https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497459#comment-13497459 ] Tim Smith commented on LUCENE-4557: --- getSegmentForMerge could of course take AtomicReader to support addIndexes as well CheckIndex validates indexed positions/offsets against term vectors? isn't this really slow? Also, if term vectors were indexed with offsets, and the positions did not have offsets, and offsets are being added to positions as part of the merge, i could easily have my MergeReader enhance the indexed positions offsets from the term vectors. Of course this would be a slower merge, but it would then have 100% the right data and not result in the corruption you allude to. This would then make term vectors consistent and suitable for bulk merge. (right now i don't have a use case that would have offsets indexed for both term vectors and positions (it'd be one or the other), but its helpful you pointed this issue out so i can make sure it would be handled properly in the future) How about i look at working on a patch going down the pluggable segment data merging and we can iterate from there? Indexed Offsets Can Be Lost During Merge Key: LUCENE-4557 URL: https://issues.apache.org/jira/browse/LUCENE-4557 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.0 Reporter: Tim Smith Attachments: OffsetsTest.java Primary Use case: Start with pre-4.0 index (no indexed offsets available) Start indexing new documents with indexed offsets (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) merge/optimize index newly indexed documents will now no longer have offsets available In general, it is impossible to ever change a field to have offsets indexed when starting with an existing index as a merge will cause offsets to be removed from the index. Desirable behavior would be for new documents to have offsets indexed properly, and old documents would have offset of 0, 0 for all positions after merging with a segment that contains offsets Current behavior can be very dangerous. for example: * Start indexing documents with indexed offsets * change config to not index offsets by accident * index 1 document * revert config back * offsets will start disappearing from documents as segments are merged -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4398) Memory Leak in TermsHashPerField memory tracking
Tim Smith created LUCENE-4398: - Summary: Memory Leak in TermsHashPerField memory tracking Key: LUCENE-4398 URL: https://issues.apache.org/jira/browse/LUCENE-4398 Project: Lucene - Core Issue Type: Bug Affects Versions: 3.4 Reporter: Tim Smith I am witnessing an apparent leak in the memory tracking used to determine when a flush is necessary. Over time, this will result in every single document being flushed into its own segment as the memUsage will remain above the configured buffer size, causing a flush to be triggered after every add/update. Best I can figure, this is being caused by TermsHashPerField's tracking of memory usage for postingsHash and/or postingsArray combined with multi-threaded feeding. I suspect that the TermsHashPerField's postingsHash is growing in one thread, then, when a segment is flushed, a single, different thread will merge all TermsHashPerFields in FreqProxTermsWriter and then call shrinkHash(). I suspect this call of shrinkHash() is seeing an old postingsHash array, and subsequently not releasing all the memory that was allocated. If this is the case, I am also concerned that FreqProxTermsWriter will not write the correct terms into the index, although I have not confirmed that any indexing problem occurs as of yet. NOTE: i am witnessing this growth in a test by subtracting the amount or memory allocated (but in a free state) by perDocAllocator/byteBlockAllocator/charBlocks/intBlocks from DocumentsWriter.memUsage.get() in IndexWriter.doAfterFlush() I will see this stay at a stable point for a while, then on some flushes, i will see this grow by a couple of bytes, and all subsequent flushes will never go back down the the previous state I will continue to investigate and post any additional findings -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4398) Memory Leak in TermsHashPerField memory tracking
[ https://issues.apache.org/jira/browse/LUCENE-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457151#comment-13457151 ] Tim Smith commented on LUCENE-4398: --- More information: I started tracking the memory usage internally in TermsHashPerField this resulted in an internal AtomicLong that held the amount of memory that was held by this class i then added a finalize() method that dumped the memory held to stdout result: as soon as i witnessed the memory grow, i forced garbage collection (via yourkit profiler) i then saw the finalize methods were called and the memory held by all garbage collected TermsHashPerField instances equaled the amount of memory that was leaked looks like the DocumentsWriter is releasing thread states without freeing the bytesUsed()? NOTE: this puts my concerns about thread safety/improper indexing to rest Memory Leak in TermsHashPerField memory tracking -- Key: LUCENE-4398 URL: https://issues.apache.org/jira/browse/LUCENE-4398 Project: Lucene - Core Issue Type: Bug Affects Versions: 3.4 Reporter: Tim Smith I am witnessing an apparent leak in the memory tracking used to determine when a flush is necessary. Over time, this will result in every single document being flushed into its own segment as the memUsage will remain above the configured buffer size, causing a flush to be triggered after every add/update. Best I can figure, this is being caused by TermsHashPerField's tracking of memory usage for postingsHash and/or postingsArray combined with multi-threaded feeding. I suspect that the TermsHashPerField's postingsHash is growing in one thread, then, when a segment is flushed, a single, different thread will merge all TermsHashPerFields in FreqProxTermsWriter and then call shrinkHash(). I suspect this call of shrinkHash() is seeing an old postingsHash array, and subsequently not releasing all the memory that was allocated. If this is the case, I am also concerned that FreqProxTermsWriter will not write the correct terms into the index, although I have not confirmed that any indexing problem occurs as of yet. NOTE: i am witnessing this growth in a test by subtracting the amount or memory allocated (but in a free state) by perDocAllocator/byteBlockAllocator/charBlocks/intBlocks from DocumentsWriter.memUsage.get() in IndexWriter.doAfterFlush() I will see this stay at a stable point for a while, then on some flushes, i will see this grow by a couple of bytes, and all subsequent flushes will never go back down the the previous state I will continue to investigate and post any additional findings -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4398) Memory Leak in TermsHashPerField memory tracking
[ https://issues.apache.org/jira/browse/LUCENE-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457196#comment-13457196 ] Tim Smith commented on LUCENE-4398: --- Looks like the culprit is DocFieldProcessorPerThread.trimFields() this method releases fields that were not seen recently for each field, this leaks 16 bytes from DocumentsWriter.bytesUsed's memory accounting Memory Leak in TermsHashPerField memory tracking -- Key: LUCENE-4398 URL: https://issues.apache.org/jira/browse/LUCENE-4398 Project: Lucene - Core Issue Type: Bug Affects Versions: 3.4 Reporter: Tim Smith I am witnessing an apparent leak in the memory tracking used to determine when a flush is necessary. Over time, this will result in every single document being flushed into its own segment as the memUsage will remain above the configured buffer size, causing a flush to be triggered after every add/update. Best I can figure, this is being caused by TermsHashPerField's tracking of memory usage for postingsHash and/or postingsArray combined with multi-threaded feeding. I suspect that the TermsHashPerField's postingsHash is growing in one thread, then, when a segment is flushed, a single, different thread will merge all TermsHashPerFields in FreqProxTermsWriter and then call shrinkHash(). I suspect this call of shrinkHash() is seeing an old postingsHash array, and subsequently not releasing all the memory that was allocated. If this is the case, I am also concerned that FreqProxTermsWriter will not write the correct terms into the index, although I have not confirmed that any indexing problem occurs as of yet. NOTE: i am witnessing this growth in a test by subtracting the amount or memory allocated (but in a free state) by perDocAllocator/byteBlockAllocator/charBlocks/intBlocks from DocumentsWriter.memUsage.get() in IndexWriter.doAfterFlush() I will see this stay at a stable point for a while, then on some flushes, i will see this grow by a couple of bytes, and all subsequent flushes will never go back down the the previous state I will continue to investigate and post any additional findings -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4398) Memory Leak in TermsHashPerField memory tracking
[ https://issues.apache.org/jira/browse/LUCENE-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457214#comment-13457214 ] Tim Smith commented on LUCENE-4398: --- Found a easy fix for this: commenting out the bytesUsed(postingsHashSize * RamUsageEstimator.NUM_BYTES_INT) line from TermsHashPerField's constructor does the trick This results in not accounting for 16 bytes for each field for each thread, this being the same 16 bytes that were not being reclaimed by trimFields() I suppose a more robust means to fix this would be to add a destroy() method to the PerField interfaces that would release this memory (however that would be a rather large patch) Also found a relatively easy way to reproduce this: Feed N documents with fields A-M force flush Feed N documents with fields N-Z force flush Repeat it will take a long time to actually consume all the memory (more fields used in test should accelerate things) Memory Leak in TermsHashPerField memory tracking -- Key: LUCENE-4398 URL: https://issues.apache.org/jira/browse/LUCENE-4398 Project: Lucene - Core Issue Type: Bug Affects Versions: 3.4 Reporter: Tim Smith I am witnessing an apparent leak in the memory tracking used to determine when a flush is necessary. Over time, this will result in every single document being flushed into its own segment as the memUsage will remain above the configured buffer size, causing a flush to be triggered after every add/update. Best I can figure, this is being caused by TermsHashPerField's tracking of memory usage for postingsHash and/or postingsArray combined with multi-threaded feeding. I suspect that the TermsHashPerField's postingsHash is growing in one thread, then, when a segment is flushed, a single, different thread will merge all TermsHashPerFields in FreqProxTermsWriter and then call shrinkHash(). I suspect this call of shrinkHash() is seeing an old postingsHash array, and subsequently not releasing all the memory that was allocated. If this is the case, I am also concerned that FreqProxTermsWriter will not write the correct terms into the index, although I have not confirmed that any indexing problem occurs as of yet. NOTE: i am witnessing this growth in a test by subtracting the amount or memory allocated (but in a free state) by perDocAllocator/byteBlockAllocator/charBlocks/intBlocks from DocumentsWriter.memUsage.get() in IndexWriter.doAfterFlush() I will see this stay at a stable point for a while, then on some flushes, i will see this grow by a couple of bytes, and all subsequent flushes will never go back down the the previous state I will continue to investigate and post any additional findings -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4398) Memory Leak in TermsHashPerField memory tracking
[ https://issues.apache.org/jira/browse/LUCENE-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457244#comment-13457244 ] Tim Smith commented on LUCENE-4398: --- NOTE: 16 bytes of unaccounted space in the postingsHash is actually much less than the object header fields require for TermsHashPerField so, i would argue that not accounting for this 16 bytes is a valid low-profile fix to this the only gotcha would be if trimFields() is ever called on a TermsHashPerField that has not been shrunk down to size due to a flush is this possible? even if possible, i expect this only occurs in the rare case of deep-down exceptions? in that case, if abort() is called, i suppose the abort() method can be updated to shrink down the hash as well (if this is safe to do) Memory Leak in TermsHashPerField memory tracking -- Key: LUCENE-4398 URL: https://issues.apache.org/jira/browse/LUCENE-4398 Project: Lucene - Core Issue Type: Bug Affects Versions: 3.4 Reporter: Tim Smith I am witnessing an apparent leak in the memory tracking used to determine when a flush is necessary. Over time, this will result in every single document being flushed into its own segment as the memUsage will remain above the configured buffer size, causing a flush to be triggered after every add/update. Best I can figure, this is being caused by TermsHashPerField's tracking of memory usage for postingsHash and/or postingsArray combined with multi-threaded feeding. I suspect that the TermsHashPerField's postingsHash is growing in one thread, then, when a segment is flushed, a single, different thread will merge all TermsHashPerFields in FreqProxTermsWriter and then call shrinkHash(). I suspect this call of shrinkHash() is seeing an old postingsHash array, and subsequently not releasing all the memory that was allocated. If this is the case, I am also concerned that FreqProxTermsWriter will not write the correct terms into the index, although I have not confirmed that any indexing problem occurs as of yet. NOTE: i am witnessing this growth in a test by subtracting the amount or memory allocated (but in a free state) by perDocAllocator/byteBlockAllocator/charBlocks/intBlocks from DocumentsWriter.memUsage.get() in IndexWriter.doAfterFlush() I will see this stay at a stable point for a while, then on some flushes, i will see this grow by a couple of bytes, and all subsequent flushes will never go back down the the previous state I will continue to investigate and post any additional findings -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3373) waitForMerges deadlocks if background merge fails
[ https://issues.apache.org/jira/browse/LUCENE-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089674#comment-13089674 ] Tim Smith commented on LUCENE-3373: --- waitForMerges should continue to wait until all merges are complete (regardless of if they all end up failing) i would suggest updating the MergeThread to catch all exceptions and allow processing the next merge. right now, any merge failure results in a ThreadDeath, which seems rather nasty. should probably just catch the exception and log a index trace message waitForMerges deadlocks if background merge fails - Key: LUCENE-3373 URL: https://issues.apache.org/jira/browse/LUCENE-3373 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.0.3 Reporter: Tim Smith waitForMerges can deadlock if a merge fails for ConcurrentMergeScheduler this is because the merge thread will die, but pending merges are still available normally, the merge thread will pick up the next merge once it finishes the previous merge, but in the event of a merge exception, the pending work is not resumed, but waitForMerges won't complete until all pending work is complete i worked around this by overriding doMerge() like so: {code} protected final void doMerge(MergePolicy.OneMerge merge) throws IOException { try { super.doMerge(merge); } catch (Throwable exc) { // Just logging the exception and not rethrowing // insert logging code here } } {code} Here's the rough steps i used to reproduce this issue: override doMerge like so {code} protected final void doMerge(MergePolicy.OneMerge merge) throws IOException { try {Thread.sleep(500L);} catch (InterruptedException e) { } super.doMerge(merge); throw new IOException(fail); } {code} then, if you do the following: loop 50 times: addDocument // any doc commit waitForMerges // This will deadlock sometimes SOLR-2017 may be related to this (stack trace for deadlock looked related) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3373) waitForMerges deadlocks if background merge fails
waitForMerges deadlocks if background merge fails - Key: LUCENE-3373 URL: https://issues.apache.org/jira/browse/LUCENE-3373 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.0.3 Reporter: Tim Smith waitForMerges can deadlock if a merge fails for ConcurrentMergeScheduler this is because the merge thread will die, but pending merges are still available normally, the merge thread will pick up the next merge once it finishes the previous merge, but in the event of a merge exception, the pending work is not resumed, but waitForMerges won't complete until all pending work is complete i worked around this by overriding doMerge() like so: {code} protected final void doMerge(MergePolicy.OneMerge merge) throws IOException { try { super.doMerge(merge); } catch (Throwable exc) { // Just logging the exception and not rethrowing // insert logging code here } } {code} Here's the rough steps i used to reproduce this issue: override doMerge like so {code} protected final void doMerge(MergePolicy.OneMerge merge) throws IOException { try {Thread.sleep(500L);} catch (InterruptedException e) { } super.doMerge(merge); throw new IOException(fail); } {code} then, if you do the following: loop 50 times: addDocument // any doc commit waitForMerges // This will deadlock sometimes SOLR-2017 may be related to this (stack trace for deadlock looked related) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2658) TestIndexWriterExceptions random failure: AIOOBE in ByteBlockPool.allocSlice
[ https://issues.apache.org/jira/browse/LUCENE-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913097#action_12913097 ] Tim Smith commented on LUCENE-2658: --- Is this related to/same as LUCENE-2501? TestIndexWriterExceptions random failure: AIOOBE in ByteBlockPool.allocSlice Key: LUCENE-2658 URL: https://issues.apache.org/jira/browse/LUCENE-2658 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Michael McCandless Attachments: LUCENE-2658.patch, LUCENE-2658_environment.patch TestIndexWriterExceptions threw this today, and its reproducable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2658) TestIndexWriterExceptions random failure: AIOOBE in ByteBlockPool.allocSlice
[ https://issues.apache.org/jira/browse/LUCENE-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913140#action_12913140 ] Tim Smith commented on LUCENE-2658: --- sadly i haven't been able to gather the infostream for LUCENE-2501 there's a comment on LUCENE-2501 that seems to indicate the exception that started it all though (CorruptIndexException: docs out of order (607 = 607 )) TestIndexWriterExceptions random failure: AIOOBE in ByteBlockPool.allocSlice Key: LUCENE-2658 URL: https://issues.apache.org/jira/browse/LUCENE-2658 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Michael McCandless Attachments: LUCENE-2658.patch, LUCENE-2658_environment.patch TestIndexWriterExceptions threw this today, and its reproducable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2276) Add IndexReader.document(int, Document, FieldSelector)
[ https://issues.apache.org/jira/browse/LUCENE-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888323#action_12888323 ] Tim Smith commented on LUCENE-2276: --- instead of doing the following everywhere: {code} final Document doc(int n, FieldSelector fieldSelector) throws CorruptIndexException, IOException { return doc(n, null, fieldSelector); } {code} you could do: {code} final Document doc(int n, FieldSelector fieldSelector) throws CorruptIndexException, IOException { return doc(n, new Document(), fieldSelector); } {code} then, the interface for doc(int, Document, FieldSelector) can state that the document must not be null, and can skip the if null, new Document check later on Add IndexReader.document(int, Document, FieldSelector) -- Key: LUCENE-2276 URL: https://issues.apache.org/jira/browse/LUCENE-2276 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Tim Smith Attachments: LUCENE-2276.patch The Document object passed in would be populated with the fields identified by the FieldSelector for the specified internal document id This method would allow reuse of Document objects when retrieving stored fields from the index -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
[ https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881675#action_12881675 ] Tim Smith commented on LUCENE-2501: --- I've been informed that this exception is still happening however, whenever index tracing is turned on, it never seems to occur (extra logging seems to be preventing some lower level synchronization issue from surfacing) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice -- Key: LUCENE-2501 URL: https://issues.apache.org/jira/browse/LUCENE-2501 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Reporter: Tim Smith I'm seeing the following exception during indexing: {code} Caused by: java.lang.ArrayIndexOutOfBoundsException: 14 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118) at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085) ... 37 more {code} This seems to be caused by the following code: {code} final int level = slice[upto] 15; final int newLevel = nextLevelArray[level]; final int newSize = levelSizeArray[newLevel]; {code} this can result in level being a value between 0 and 14 the array nextLevelArray is only of size 10 i suspect the solution would be to either max the level to 10, or to add more entries to the nextLevelArray so it has 15 entries however, i don't know if something more is going wrong here and this is just where the exception hits from a deeper issue -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice -- Key: LUCENE-2501 URL: https://issues.apache.org/jira/browse/LUCENE-2501 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Reporter: Tim Smith I'm seeing the following exception during indexing: {code} Caused by: java.lang.ArrayIndexOutOfBoundsException: 14 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118) at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085) ... 37 more {code} This seems to be caused by the following code: {code} final int level = slice[upto] 15; final int newLevel = nextLevelArray[level]; final int newSize = levelSizeArray[newLevel]; {code} this can result in level being a value between 0 and 14 the array nextLevelArray is only of size 10 i suspect the solution would be to either max the level to 10, or to add more entries to the nextLevelArray so it has 15 entries however, i don't know if something more is going wrong here and this is just where the exception hits from a deeper issue -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
[ https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879382#action_12879382 ] Tim Smith commented on LUCENE-2501: --- thats what i was afraid of i got this report second hand, so i don't have access to the data that was being ingested and i currently don't know enough about this section of the indexing code to guess in order to create a unit test i'll try to create a test, but i expect it will be difficult (especially if no one else has ever seen this) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice -- Key: LUCENE-2501 URL: https://issues.apache.org/jira/browse/LUCENE-2501 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Reporter: Tim Smith I'm seeing the following exception during indexing: {code} Caused by: java.lang.ArrayIndexOutOfBoundsException: 14 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118) at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085) ... 37 more {code} This seems to be caused by the following code: {code} final int level = slice[upto] 15; final int newLevel = nextLevelArray[level]; final int newSize = levelSizeArray[newLevel]; {code} this can result in level being a value between 0 and 14 the array nextLevelArray is only of size 10 i suspect the solution would be to either max the level to 10, or to add more entries to the nextLevelArray so it has 15 entries however, i don't know if something more is going wrong here and this is just where the exception hits from a deeper issue -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
[ https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879403#action_12879403 ] Tim Smith commented on LUCENE-2501: --- Here's all the info i have available right now (will try to get more): 16 core, 18-gig ram Windows 7 machine 1 JVM 16 index writers (each using default settings (64M ram, etc)) 300+ docs/sec ingestion (small documents) commit every 10 minutes optimize every hour The report i got indicated that every now and then one of these ArrayIndexOutOfBounds exceptions would occur this would result in the document being indexed failing, but otherwise things would continue normally ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice -- Key: LUCENE-2501 URL: https://issues.apache.org/jira/browse/LUCENE-2501 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Reporter: Tim Smith I'm seeing the following exception during indexing: {code} Caused by: java.lang.ArrayIndexOutOfBoundsException: 14 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118) at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085) ... 37 more {code} This seems to be caused by the following code: {code} final int level = slice[upto] 15; final int newLevel = nextLevelArray[level]; final int newSize = levelSizeArray[newLevel]; {code} this can result in level being a value between 0 and 14 the array nextLevelArray is only of size 10 i suspect the solution would be to either max the level to 10, or to add more entries to the nextLevelArray so it has 15 entries however, i don't know if something more is going wrong here and this is just where the exception hits from a deeper issue -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
[ https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879422#action_12879422 ] Tim Smith commented on LUCENE-2501: --- Some more info: ingestion is being performed in multiple threads ArrayIndexOutOfBounds exception is occurring in bursts I suspect that these bursts of exceptions stop after the next commit (at which point the buffers are all reset) NOTE: i have not yet confirmed this, but i suspect it ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice -- Key: LUCENE-2501 URL: https://issues.apache.org/jira/browse/LUCENE-2501 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Reporter: Tim Smith I'm seeing the following exception during indexing: {code} Caused by: java.lang.ArrayIndexOutOfBoundsException: 14 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118) at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085) ... 37 more {code} This seems to be caused by the following code: {code} final int level = slice[upto] 15; final int newLevel = nextLevelArray[level]; final int newSize = levelSizeArray[newLevel]; {code} this can result in level being a value between 0 and 14 the array nextLevelArray is only of size 10 i suspect the solution would be to either max the level to 10, or to add more entries to the nextLevelArray so it has 15 entries however, i don't know if something more is going wrong here and this is just where the exception hits from a deeper issue -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
[ https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879483#action_12879483 ] Tim Smith commented on LUCENE-2501: --- Looks like this may be the original source of the errors {code} Caused by: org.apache.lucene.index.CorruptIndexException: docs out of order (607 = 607 ) at org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:76) at org.apache.lucene.index.FreqProxTermsWriter.appendPostings(FreqProxTermsWriter.java:209) at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:127) at org.apache.lucene.index.TermsHash.flush(TermsHash.java:144) at org.apache.lucene.index.DocInverter.flush(DocInverter.java:72) at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:64) at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:583) at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3602) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3511) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3502) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2103) {code} ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice -- Key: LUCENE-2501 URL: https://issues.apache.org/jira/browse/LUCENE-2501 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Reporter: Tim Smith I'm seeing the following exception during indexing: {code} Caused by: java.lang.ArrayIndexOutOfBoundsException: 14 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118) at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085) ... 37 more {code} This seems to be caused by the following code: {code} final int level = slice[upto] 15; final int newLevel = nextLevelArray[level]; final int newSize = levelSizeArray[newLevel]; {code} this can result in level being a value between 0 and 14 the array nextLevelArray is only of size 10 i suspect the solution would be to either max the level to 10, or to add more entries to the nextLevelArray so it has 15 entries however, i don't know if something more is going wrong here and this is just where the exception hits from a deeper issue -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
[ https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879489#action_12879489 ] Tim Smith commented on LUCENE-2501: --- will do may take some time before it occurs again also, if this boils down to a synchronization error of some sort, the extra file io done to write the trace info to disk may add some implicit synchronization/slowdown that may result in not being able to reproduce the issue (i've seen this occur on non-lucene related synchronization issues, add the extra debug logging and it never fails anymore) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice -- Key: LUCENE-2501 URL: https://issues.apache.org/jira/browse/LUCENE-2501 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Reporter: Tim Smith I'm seeing the following exception during indexing: {code} Caused by: java.lang.ArrayIndexOutOfBoundsException: 14 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118) at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085) ... 37 more {code} This seems to be caused by the following code: {code} final int level = slice[upto] 15; final int newLevel = nextLevelArray[level]; final int newSize = levelSizeArray[newLevel]; {code} this can result in level being a value between 0 and 14 the array nextLevelArray is only of size 10 i suspect the solution would be to either max the level to 10, or to add more entries to the nextLevelArray so it has 15 entries however, i don't know if something more is going wrong here and this is just where the exception hits from a deeper issue -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857375#action_12857375 ] Tim Smith commented on LUCENE-2324: --- bq. But... could we allow an add/updateDocument call to express this affinity, explicitly? i would love to be able to explicitly define a segment affinity for documents i'm feeding this would then allow me to say: all docs from table a has affinity 1 all docs from table b has affinity 2 this would ideally result in indexing documents from each table into a different segment (obviously, i would then need to be able to have segment merging be affinity aware so optimize/merging would only merge segments that share an affinity) Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 Attachments: lucene-2324.patch, LUCENE-2324.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857385#action_12857385 ] Tim Smith commented on LUCENE-2324: --- bq. Probably if you really want to keep the segments segregated like that, you should in fact index to separate indices? Thats what i'm currently thinking i'll have to do however it would be ideal if i could either subclass IndexWriter or use IndexWriter directly with this affinity concept (potentially writing my own segment merger that is affinity aware) that makes it so i can easily use near real time indexing, as only one IndexWriter will be in the mix, as well as make managing deletes and a whole other host of issues with multiple indexes disappear Also makes it so i can configure memory settings across all affinity groups instead of having to dynamically create them, each with their own memory bounds Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 Attachments: lucene-2324.patch, LUCENE-2324.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2071) Allow updating of IndexWriter SegmentReaders
[ https://issues.apache.org/jira/browse/LUCENE-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851388#action_12851388 ] Tim Smith commented on LUCENE-2071: --- +1 I have a special subclassed IndexSearcher that certain special queries require, so IndexWriter's delete by query will fail as an IndexSearcher is passed in With this added method, i would be able to construct my own Searcher over the readers and then apply deletes properly This would also allow counting the deletes as they occur as well (which is commonly desired when deleting by query) It would be nice if this method would also work with non-pooled readers so my desired method signature would be: void updateReaders(Readers callback, boolean pooled) if the readers were already pooled, this would have no effect, otherwise it would just open the segment readers just like the non-pooled delete readers are opened Allow updating of IndexWriter SegmentReaders Key: LUCENE-2071 URL: https://issues.apache.org/jira/browse/LUCENE-2071 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-2071.patch This discussion kind of started in LUCENE-2047. Basically, we'll allow users to perform delete document, and norms updates on SegmentReaders that are handled by IndexWriter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2071) Allow updating of IndexWriter SegmentReaders
[ https://issues.apache.org/jira/browse/LUCENE-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851528#action_12851528 ] Tim Smith commented on LUCENE-2071: --- found a couple of small issues with the patch attached to this ticket: 1. applyDeletes issue saw this was in another ticket think the flush should be flush(true, true, false) and applyDeletes() should be called in the synchronized block 2. IndexWriter.changeCount not updated the call() method does not return a boolean indicating if there were any changes that would need to be committed as a result, if no other changes are made to the indexwriter, the commit will be skipped, even though deletes/norm updates were sent in IndexReader.reopen() will then return the old reader without the deletes/norms Allow updating of IndexWriter SegmentReaders Key: LUCENE-2071 URL: https://issues.apache.org/jira/browse/LUCENE-2071 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-2071.patch This discussion kind of started in LUCENE-2047. Basically, we'll allow users to perform delete document, and norms updates on SegmentReaders that are handled by IndexWriter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850127#action_12850127 ] Tim Smith commented on LUCENE-2345: --- bq. I think we should only commit this only on 3.1 (new feature)? 3.1 only of course (just posted a 3.0 patch now as that's what i'm using and i need the functionality now) bq. Tim, do you think the plugin model (extension by composition) would be workable for your use case? Ie, instead of a factory enabling subclasses of SegmentReader? As long as the plugin model allows the same capabilities, that could work just fine and could be the final solution for this ticket. I mainly need the ability to add data structures to a SegmentReader that will be shared for all SegmentReader's for a segment, and then add some extra meta information on a per instance basis Is there a ticket or wiki page that details the plugin architecture/design so i could take a look? However, would the plugins allow overriding specific IndexReader methods? I still would see the need to be able to override specific methods for a SegmentReader (in order to track statistics/provide changed/different/faster/more feature rich implementations) I don't have a direct need for this right now, however i could envision needing this in the future Here's a few requirements i would pose for the plugin model (maybe they are already though of): * Plugins have hooks to reopen themselves (some plugins can be shared across all instances of a SegmentReader) ** These reopen hooks would be called during SegmentReader.reopen() * Plugins are initialized during SegmentReader.get/SegmentReader.reopen ** plugins should not have to be added after the fact, as this would not allow proper warming/initializing of plugins inside the NRT indexing ** i assume this would need be added as some list of PluginFactories added to the IndexWriter/IndexReader.open()? * Plugins should have a close method that is called in SegmentReader.close() ** This will allow proper release of any resources * Plugins are passed an instance of the SegmentReader they are for ** Plugins should be able to access all methods on a SegmentReader ** This would effectively allow overriding a SegmentReader by having a plugin provide the functionality instead (however only people explicitly calling the plugin would get this benefit) Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-2345: -- Attachment: LUCENE-2345_3.0.plugins.patch Here's a patch (again, against 3.0) showing the minimal API i would like to see from the plugin model Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850323#action_12850323 ] Tim Smith commented on LUCENE-2345: --- found one issue with the plugins patch With NRT indexing, if the SegmentReader is opened with no TermInfosReader (for merging), then the plugins will be initialized with a SegmentReader that has no ability to walk the TermsEnum. I guess SegmentPlugin initialization should wait until after the terms index is loaded or have another method for catching this event to the SegmentPlugin interface Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850361#action_12850361 ] Tim Smith commented on LUCENE-2345: --- bq. My patch removes loadTermsIndex method from SegmentReader and requires you to reopen it. that's definitely much cleaner and would solve the issue in my current patch (sadly i'm on 3.0 and want to keep my patch there at a minimum until i can port to all the goodness on 3.1). bq. Also, they extend not only SegmentReader, but the whole hierarchy - SR, MR, DR, whatever. i just wussed out and just did only the SegmentReader case as thats all i need right now bq. as all the hooks are on the factory classes could you post your factory class interface? If i base my 3.0 patch off that i can reduce my 3.1 port overhead. are there any tickets tracking your reopen refactors or your plugin model? If not, feel free to retool this ticket for your plugin model for Index Readers as that will solve my use cases (and then some) Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-2345: -- Attachment: LUCENE-2345_3.0.patch Here's a patch against 3.0 that provides the SegmentReaderFactory ability (not tested yet, but i'll be doing that shortly as i integrate this functionality) It adds a SegmentReaderFactory. The IndexWriter now has a getter and setter for setting this SegmentReader has a new protected method init() which is called after the segment reader has been initialized (to allow subclasses to hook this action and do additional initialization, etc added 2 new IndexReader.open() calls that allow specifying the SegmentReaderFactory Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849731#action_12849731 ] Tim Smith commented on LUCENE-2345: --- that was my plan Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2345) Make it possible to subclass SegmentReader
Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849358#action_12849358 ] Tim Smith commented on LUCENE-1821: --- This would actually be solved by LUCENE-2345 for me as i would then be able to tag SegmentReaders with any additional accounting information i would need Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849455#action_12849455 ] Tim Smith commented on LUCENE-2345: --- that's the reassurance i needed :) will start working on a patch tomorrow will take a few days as i'll start with a 3.0 patch (which i use), then will create a 3.1 patch once i've got that all flushed out Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849497#action_12849497 ] Tim Smith commented on LUCENE-2345: --- i'll do my initial work on 3.0 so i can absorb the changes now and will post that patch at which point, i can wait for you to finish whatever you need, or we can just incorporate the same ability into your patch for the other ticket i would just like to see the ability to subclass SegmentReader's on 3.1 so i don't have to port a patch when i absorb 3.1 (just use the finalized apis) Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity
[ https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844930#action_12844930 ] Tim Smith commented on LUCENE-2310: --- Personally, i like keeping Fieldable, (or having AbstractField just with abstract methods and no actual implementation) for feeding documents, i use custom Fieldable implementations to reduce amount of setters called, as Fields of different types have different constant settings Reduce Fieldable, AbstractField and Field complexity Key: LUCENE-2310 URL: https://issues.apache.org/jira/browse/LUCENE-2310 Project: Lucene - Java Issue Type: Sub-task Components: Index Reporter: Chris Male Attachments: LUCENE-2310-Deprecate-AbstractField.patch In order to move field type like functionality into its own class, we really need to try to tackle the hierarchy of Fieldable, AbstractField and Field. Currently AbstractField depends on Field, and does not provide much more functionality that storing fields, most of which are being moved over to FieldType. Therefore it seems ideal to try to deprecate AbstractField (and possible Fieldable), moving much of the functionality into Field and FieldType. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839682#action_12839682 ] Tim Smith commented on LUCENE-2283: --- i haven't been able to fully replicate this issue in a unit test scenario, however it will definitely resolve that 40M of ram that was allocated and never released for the RAMFiles on the StoredFieldsWriter (keeping that bound to the configured memory size) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2283.patch, LUCENE-2283.patch, LUCENE-2283.patch StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838976#action_12838976 ] Tim Smith commented on LUCENE-2283: --- I'll work up another patch might take me a few minutes to get my head wrapped around the TermVectorsTermsWriter stuff Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2283.patch StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-2283: -- Attachment: LUCENE-2283.patch Here's a new patch with your suggestions Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2283.patch, LUCENE-2283.patch StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-2283: -- Attachment: LUCENE-2283.patch Here's a patch for using a pool for stored fields buffers Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2283.patch StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837793#action_12837793 ] Tim Smith commented on LUCENE-2283: --- I came across this issue looking for a reported memory leak during indexing a yourkit snapshot showed that the PerDocs for an IndexWriter were using ~40M of memory (at which point i came across this potentially unbounded memory use in StoredFieldsWriter) this snapshot seems more or less at a stable point (memory grows but then returns to a normal state), however i have reports that eventually the memory is completely exhausted resulting in out of memory errors. I so far have not found any other major culprit in the lucene indexing code. This index receives a routine mix of very large and very small documents (which would explain this situation) The VM and system have more than ample amount of memory given the buffer size and what should be normal indexing RAM requirements. Also, a major difference between this leak not occurring and it showing up is that previously, the IndexWriter was closed when performing commits, now the IndexWriter remains open (just calling IndexWriter.commit()). So, if any memory is leaking during indexing, it is no longer being reclaimed during commit. As a side note, closing the index writer at commit time would sometimes fail, resulting in some following updates to fail because the index writer was locked and couldn't be reopened until the old index writer was garbage collected, so i don't want to go back to this for commits. Its possible there is a leak somewhere else (i currently do not have a snapshot right before out of memory issues occur, so currently the only thing that stands out is the PerDoc memory use) As far as a fix goes, wouldn't it be better to have the RAMFile's used for stored fields pull and return byte buffers from the byte block pool on the DocumentsWriter? This would allow the memory to be reclaimed based on the index writers buffer size (otherwise there is no configurable way to tune this memory use) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837821#action_12837821 ] Tim Smith commented on LUCENE-2283: --- ramBufferSizeMB is 64MB Here's the yourkit breakdown per class: * DocumentsWriter - 256 MB ** TermsHash - 38.7 MB ** StoredFieldsWriter - 37.5 MB ** DocumentsWriterThreadState - 36.2 MB ** DocumentsWriterThreadState - 34.6 MB ** DocumentsWriterThreadState - 33.8 MB ** DocumentsWriterThreadState - 27.5 MB ** DocumentsWriterThreadState - 13.4 MB I'm starting to dig into the ThreadStates now to see if anything stands out here bq. Hmm, that makes me nervous, because I think in this case the use should be bounded. I should be getting a new profile dump at crash time soon, so hopefully that will make things clearer bq. That doesn't sound good! Can you post some details on this (eg an exception)? If i recall correctly, I think the exception was caused by an out of disk space situation (which would recover) obviously not much that can be done about this other than adding more disk space, however the situation would recover, but docs would be lost in the interum bq. But, anyway, keeping the same IW open and just calling commit is (should be) fine. Yeah, this should be the way to go, especially as it results in the pooled buffers not needing to be reallocated/reclaimed/etc, however right now this is the only change i can currently think of that could result in memory issues. bq. Yes, that's a great solution - a single pool. But that's a somewhat bigger change. Seems like this would be the best approach as it makes the memory bounded by the configuration of the engine, giving better reuse of byte blocks and better ability to reclaim memory (in DocumentsWriter.balanceRAM()) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837875#action_12837875 ] Tim Smith commented on LUCENE-2283: --- bq. I agree. I'll mull over how to do it... unless you're planning on consing up a patch I'd love to, but don't have the free cycles at the moment :( bq. How many threads do you pass through IW? honestly don't 100% know about the origin of the threads i'm given In general, they should be from a static pool, but may be dynamically allocated if the static pool runs out One thought i had recently was to control this more tightly by having a limited number of static threads that called IndexWriter methods in case that was the issue (but that would be a pretty big change) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837881#action_12837881 ] Tim Smith commented on LUCENE-2283: --- latest profile dump has pointed out a non-lucene issue as causing some memory growth so feel free to drop down priority however it seems like using the bytepool for the stored fields would be good overall Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837919#action_12837919 ] Tim Smith commented on LUCENE-2283: --- another note is that this was on 64 bit vm i've noticed that all the memsize calculations assume 4 byte pointers, so perhaps that can lead to more memory being used that would otherwise be expected (although 256 MB is still well over the 2X mem use that would potentially be expected in that case) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838017#action_12838017 ] Tim Smith commented on LUCENE-2283: --- i'm working up a patch for the shared byteblock pool for stored field buffers (found a few cycles) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2276) Add IndexReader.document(int, Document, FieldSelector)
Add IndexReader.document(int, Document, FieldSelector) -- Key: LUCENE-2276 URL: https://issues.apache.org/jira/browse/LUCENE-2276 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Tim Smith The Document object passed in would be populated with the fields identified by the FieldSelector for the specified internal document id This method would allow reuse of Document objects when retrieving stored fields from the index -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790803#action_12790803 ] Tim Smith commented on LUCENE-1923: --- added getName() in case anyone is currently relying on current (default) output from toString() on index readers feel free to rename the getName() methods to toString() Add toString() or getName() method to IndexReader - Key: LUCENE-1923 URL: https://issues.apache.org/jira/browse/LUCENE-1923 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Assignee: Michael McCandless Attachments: LUCENE-1923.patch It would be very useful for debugging if IndexReader either had a getName() method, or a toString() implementation that would get a string identification for the reader. for SegmentReader, this would return the same as getSegmentName() for Directory readers, this would return the generation id? for MultiReader, this could return something like multi(sub reader name, sub reader name, sub reader name, ...) right now, i have to check instanceof for SegmentReader, then call getSegmentName(), and for all other IndexReader types, i would have to do something like get the IndexCommit and get the generation off it (and this may throw UnsupportedOperationException, at which point i have would have to recursively walk sub readers and try again) I could work up a patch if others like this idea -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787472#action_12787472 ] Tim Smith commented on LUCENE-1923: --- i won't have the time till after the new year. if someone else wants to work up a patch, go for it (this seems simple enough and adds some nice info capabilities for logging/etc), otherwise, i'll get to it when i can Add toString() or getName() method to IndexReader - Key: LUCENE-1923 URL: https://issues.apache.org/jira/browse/LUCENE-1923 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith It would be very useful for debugging if IndexReader either had a getName() method, or a toString() implementation that would get a string identification for the reader. for SegmentReader, this would return the same as getSegmentName() for Directory readers, this would return the generation id? for MultiReader, this could return something like multi(sub reader name, sub reader name, sub reader name, ...) right now, i have to check instanceof for SegmentReader, then call getSegmentName(), and for all other IndexReader types, i would have to do something like get the IndexCommit and get the generation off it (and this may throw UnsupportedOperationException, at which point i have would have to recursively walk sub readers and try again) I could work up a patch if others like this idea -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1923) Add toString() or getName() method to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1923: -- Attachment: LUCENE-1923.patch Here's a simple patch to get the ball rolling This adds a getName() method to IndexReader the default implementation will be: SimleClassName(subreader.getName(), subreader.getName(), ...) SegmentReader will return same value as getSegmentName() DirectoryReader will return: DirectoryReader(segment_N, segment.getName(), segment.getName(), ...) ParallelReader will return: ParallelReader(parallelReader1.getName(), parallelReader2.getName(), ...) this currently does not have a toString() implementation return getName() do with this patch as you will Add toString() or getName() method to IndexReader - Key: LUCENE-1923 URL: https://issues.apache.org/jira/browse/LUCENE-1923 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Attachments: LUCENE-1923.patch It would be very useful for debugging if IndexReader either had a getName() method, or a toString() implementation that would get a string identification for the reader. for SegmentReader, this would return the same as getSegmentName() for Directory readers, this would return the generation id? for MultiReader, this could return something like multi(sub reader name, sub reader name, sub reader name, ...) right now, i have to check instanceof for SegmentReader, then call getSegmentName(), and for all other IndexReader types, i would have to do something like get the IndexCommit and get the generation off it (and this may throw UnsupportedOperationException, at which point i have would have to recursively walk sub readers and try again) I could work up a patch if others like this idea -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786921#action_12786921 ] Tim Smith commented on LUCENE-1859: --- close if you like application writers can add guards for this if they like/need to as a custom TokenFilter mainly created this ticket as this can result in an unbound buffer should people use the token stream api incorrectly (or against suggestions of lucene core developers) TermAttributeImpl's buffer will never shrink if it grows too big -- Key: LUCENE-1859 URL: https://issues.apache.org/jira/browse/LUCENE-1859 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor This was also an issue with Token previously as well If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory Obviously, it can be argued that Tokenizer's should never emit large tokens, however it seems that the TermAttributeImpl should have a reasonable static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario) perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order
[ https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781615#action_12781615 ] Tim Smith commented on LUCENE-2086: --- Got some performance numbers: Description of test (NOTE: this is representative of actions that may occur in a running system (not a contrived test)): * feed 4 million operations (3/4 are deletes, 1/4 are updates (single field)) * commit * feed 1 million updates (about 1/3 are updates, 2/3/ deletes (randomly selected)) * commit Numbers: || Desc || Old || New || | feed 4 million | 56914ms | 15698ms | | commit 4 million | 9072ms | 14291ms | | total (4 million) | 65986ms | 29989ms | | update 1 million | 46096ms | 11340ms | | commit 1 million | 13501ms | 9273ms | | total (1 million) | 59597ms | 20613ms | This shows significant improvements with new patched data (1/3 the time for 1 million, about 1/2 the time for initial 4 million feed) This means i'm gonna definitely need to incorporate this patch while i'm still on 3.0 (will upgrade to 3.0 as soon as its out, then apply this fix) Ideally, a 3.0.1 would be forthcoming in the next month or so with this fix so i wouldn't have to maintain this patched overlay of code When resolving deletes, IW should resolve in term sort order Key: LUCENE-2086 URL: https://issues.apache.org/jira/browse/LUCENE-2086 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2086.patch See java-dev thread IndexWriter.updateDocument performance improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order
[ https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780698#action_12780698 ] Tim Smith commented on LUCENE-2086: --- any chance this can go into 3.0.0 or a 3.0.1? When resolving deletes, IW should resolve in term sort order Key: LUCENE-2086 URL: https://issues.apache.org/jira/browse/LUCENE-2086 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2086.patch See java-dev thread IndexWriter.updateDocument performance improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order
[ https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780701#action_12780701 ] Tim Smith commented on LUCENE-2086: --- i've seen the deletes dominating commit time quite often, so obviously it would be very useful to be able to absorb this optimization sooner than later (whats the timeframe for 3.1?) otherwise i'll have to override the classes involved and pull in this patch (never like this approach myself) When resolving deletes, IW should resolve in term sort order Key: LUCENE-2086 URL: https://issues.apache.org/jira/browse/LUCENE-2086 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2086.patch See java-dev thread IndexWriter.updateDocument performance improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order
[ https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780710#action_12780710 ] Tim Smith commented on LUCENE-2086: --- bq. maybe try it report back? i'll see if i can find some cycles to try this against the most painful use case i have bq. I'd rather see us release a 3.1 sooner rather than later, instead. yes please. I would definitely like to see a more accelerated release cycle (even if less functionality gets into each minor release) When resolving deletes, IW should resolve in term sort order Key: LUCENE-2086 URL: https://issues.apache.org/jira/browse/LUCENE-2086 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2086.patch See java-dev thread IndexWriter.updateDocument performance improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
[ https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777008#action_12777008 ] Tim Smith commented on LUCENE-1909: --- I have the following use case: i have a configuration bean, this bean can be customized via xml at config time in this bean, i expose the setting for the terms index divisor so, my bean has to have a default value for this, right now, i just use 1 for the default value. would be nice if i could just use the lucene constant instead of using 1, as the lucene constant could change in the future (not really likely, but its one less constant i have to maintain) if the default is not made public i have 2 options: # use a hard coded constant in my code for the default value (doing this right now) # use an Integer object, and have null be the default the nasty part about the second option is that i now have to do conditional opening of the reader depending on if null is the value (unset), when it would be much simpler (and easier for me to maintain), if i just always pass in that value Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public --- Key: LUCENE-1909 URL: https://issues.apache.org/jira/browse/LUCENE-1909 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Uwe Schindler Priority: Trivial Fix For: 3.0 Attachments: LUCENE_1909.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org