from:"Tim Smith \(JIRA\)"

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-19 Thread Tim Smith (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140703#comment-14140703
]

Tim Smith commented on LUCENE-5940:
---

bq. Reindexing is part and parcel of search

i think the general goal should be that this is not the case, especially as
search is adopted more and more as replacements for systems that do not have
these limitations/requirements (databases). obviously this is an ambitious goal
that can likely never be fully realized.

also, reindexing comes in 2 distinct flavors:
* cold reindexing - rm -rf the index dir, re feed
** requires 2x hardware or downtime
* live reindexing - change config, restart system, re feed all docs, change is
live once all docs have been reindexed
** obviously a good idea to snapshot any previous index and config so you can
restore later on error
** minimal downtime (just restart)
** minimal search interruption (some queries related to the change may not
match old documents until reindex is complete)
** old content can be replaced slowly over time to receive full functionality

live reindexing does have lots of pitfalls and may not always be viable. for
instance, right now it is not possible to add offsets to an index using this
approach. as soon as the a new segment is merged with an old one, the offsets
are blown away. i had filed a ticket for this. i'm not looking to reopen old
wounds here, just pointing out an issue i had with this and had to work around.

live reindexing is the goal i strive to achieve when reindexing is required
(always comes with a caveat to backup your index first for safety). some smart
choices when designing the internal schema can reduce or eliminate many
prospective issues here even without any core changes to lucene.

bq. it's strongly recommended that it be gathered into an intermediate store

these recommendations are always valid to make (and i will make them), however
this adds an entire new system to the mix. as well as new hardware, services,
maintenance, security, etc. also, given the scale and perhaps complexity of the
documents, this may not even be enough and will still require a large amount of
processing hardware to process these documents as fast as the index can index
them in a reasonable amount of time (days vs months). in general, this is just
extra complexity that will be dropped due to the higher price tag and
maintenance cost. then, when it finally is time to upgrade the end-user
expectation is that oh, we already have the data indexed, why can't we just
use that with the new software. this expectation is set due to the fact that
many customers/users are used to working with databases. i do not have this
expectation myself, however i have people downstream that do have these
expectations and i need to do my best to accommodate them whether i like it or
not.

note, i'm not trying to force any requirements on lucene devs, or soliciting
advice on specific functionality, just pointing out some real world use cases i
encounter related to discussion here.

change index backwards compatibility policy.

Key: LUCENE-5940
URL: https://issues.apache.org/jira/browse/LUCENE-5940
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir

Currently, our index backwards compatibility is unmanageable. The length of
time in which we must support old indexes is simply too long.
The index back compat works like this: everyone wants it, but there are
frequently bugs, and when push comes to shove, its not a very sexy thing to
work on/fix, so its hard to get any help.
Currently our back compat promise is just a broken promise, because we
cannot actually guarantee it for these reasons.
I propose we scale back the length of time for which we must support old
indexes.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-12 Thread Tim Smith (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131951#comment-14131951
]

Tim Smith commented on LUCENE-5940:
---

i fully understand the reasons for wanting to change the policy here. i
absolutely hate maintaining backwards compat myself. its just a nightmare and
leaves lots of rotting code laying around waiting to wreak havoc and makes it
dicey to add new functionality. i'm fully on board with that sentiment

but, i have to support it, and do so in a seamless online manner that is not
prone to user error.

i also get the feeling a lot of the lucene devs in general don't think full
reindexing is an issue and can just be done at any point with minimal cost
(just a vibe i've picked up). my experience is that this can be a many months
long process (slow sources). this seems to influence support for backwards
compatibility, as well as support for changing configuration/schema options,
for existing fields, etc

by all means, create a good upgrade tool people can use. however, it won't be
useful for me and i will need to find a different solution (which will likely
result in slowing my adoption of 5.0 when it is released)

i am in no way advocating that 5.0 should support reading 3.x indexes.

again, i'm just adding my perspective here so informed people can make a
decision based on all points of view

if the policy changes, i will just have to adapt as necessary

change index backwards compatibility policy.

Key: LUCENE-5940
URL: https://issues.apache.org/jira/browse/LUCENE-5940
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-11 Thread Tim Smith (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130121#comment-14130121
]

Tim Smith commented on LUCENE-5940:
---

i understand the desire for changing the policy here. i wish i didn't have to
care about backwards compat support, but its just the nature of things. people
have large indexes that can take a significant amount of time to reindex (due
to a slow source, or complex processing)

the current proposal here would be problematic for any lucene users who do not
release versions in lock step with lucene versions. Solr obviously would have
limited issues here since a user could just upgrade to solr 4.99 (assuming 4.99
is the final 4.x version) and then solr 5.0 and no problems.

however, if product X released with lucene 4.88 and the last minor version in
4.x line was 4.99, then the upgrade process to get to a lucene 5.0 index is now
convoluted and will require creation of custom offline tools to provide an
upgrade path. This backwards compatibility requirement is now just shifted
from the lucene devs to the lucene users and can no longer be a seamless
transition.

the current policy does not have these issues since all that i would need to do
is fire up the next version, do a forceMerge, and everything is up to date on
latest codecs. (no offline processes required, search can continue to work
during upgrade)

change index backwards compatibility policy.

Key: LUCENE-5940
URL: https://issues.apache.org/jira/browse/LUCENE-5940
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-11 Thread Tim Smith (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130138#comment-14130138
]

Tim Smith commented on LUCENE-5940:
---

5.0 should not be saddled with supporting 3.x index. 100% agree there

however, 5.0 should ideally continue to support 4.0-4.99 indexes (at least from
the codec/index reading perspective)

the best place to handle backwards compat is in the core of lucene.
otherwise, you are just going to have uses all over the place doing their own
interpretation of backwards compat, getting it wrong, broken, etc. and will
subsequently result in lots of irate user filing tickets.

if you only support the last minor version from the previous release, it makes
it difficult for everyone who was not at that exact minor release.

also, to uwe's point the indexupgrade tool is an offline process. also, in my
situation, i would need custom packaging of that tool in order to provide ease
of use/proper codec usage, etc. vs just fire up index on 5.0 and forceMerge.
the custom packaging would also require including an old version of lucene in
my project that would be packaged separately, and would just be a nightmare to
maintain.

alternatively, i would just grab the source for all removed 4.x codecs i need
and pull them into my project (this is not ideal since they are no longer
maintained by lucene devs and may have dependency issues that would require
porting)

change index backwards compatibility policy.

Key: LUCENE-5940
URL: https://issues.apache.org/jira/browse/LUCENE-5940
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-11 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130149#comment-14130149
 ] 

Tim Smith commented on LUCENE-5940:
---

time based would be much more reasonable

as long as people are on a 4.x release that is less 1-2 years old, they should 
be able to move directly to 5.0

supporting indexes 4+ years old is asking a bit much, but assuming an external 
release cycle of 1 year, a 1-2 year cutoff is manageable


 change index backwards compatibility policy.
 

 Key: LUCENE-5940
 URL: https://issues.apache.org/jira/browse/LUCENE-5940
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir

 Currently, our index backwards compatibility is unmanageable. The length of 
 time in which we must support old indexes is simply too long.
 The index back compat works like this: everyone wants it, but there are 
 frequently bugs, and when push comes to shove, its not a very sexy thing to 
 work on/fix, so its hard to get any help.
 Currently our back compat promise is just a broken promise, because we 
 cannot actually guarantee it for these reasons.
 I propose we scale back the length of time for which we must support old 
 indexes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-11 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130151#comment-14130151
 ] 

Tim Smith commented on LUCENE-5940:
---

firefox does not need to worry about an upgrade path for terabytes worth of 
data

they only need to worry about upgrading bookmarks and thats about it

 change index backwards compatibility policy.
 

 Key: LUCENE-5940
 URL: https://issues.apache.org/jira/browse/LUCENE-5940
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir

 Currently, our index backwards compatibility is unmanageable. The length of 
 time in which we must support old indexes is simply too long.
 The index back compat works like this: everyone wants it, but there are 
 frequently bugs, and when push comes to shove, its not a very sexy thing to 
 work on/fix, so its hard to get any help.
 Currently our back compat promise is just a broken promise, because we 
 cannot actually guarantee it for these reasons.
 I propose we scale back the length of time for which we must support old 
 indexes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-11 Thread Tim Smith (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130169#comment-14130169
]

Tim Smith commented on LUCENE-5940:
---

i fully understand the pain associated with maintaining back compat

i guess it would be good if you (and others) could enumerate all the issues
related here for full perspective (description does not list them)

also, it should be on the developer who removes write support (or removes a
codec) to add the backwards compat support/testing.

creating a new codec that supplants an old codec should not inherently require
removal of write support for old codec.

change index backwards compatibility policy.

Key: LUCENE-5940
URL: https://issues.apache.org/jira/browse/LUCENE-5940
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-11 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130218#comment-14130218
 ] 

Tim Smith commented on LUCENE-5940:
---

the problem with the upgrade tool approach is that it doesn't scale to clusters 
with large numbers of indexes.

for instance, a cluster that has 50 indexes spread across a bunch of machines.
this is now an involved manual task put in the hands of system administrators 
who don't really know whats going on under the hood. 

thats just asking for trouble

it seems like the whole power of codecs is that you can avoid all this and 
allow for seamless transitions by having read only codecs for previous index 
formats.

are there technical issues here i'm unaware of beyond creating and maintaining 
the backwards compat tests?
something outside of the codec mechanism that causes problems?

if not, just dump the read only codecs for old versions in an contrib module 
and let people upgrade at their leisure (and let the community find/fix bugs as 
they are encountered)

 change index backwards compatibility policy.
 

 Key: LUCENE-5940
 URL: https://issues.apache.org/jira/browse/LUCENE-5940
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir

 Currently, our index backwards compatibility is unmanageable. The length of 
 time in which we must support old indexes is simply too long.
 The index back compat works like this: everyone wants it, but there are 
 frequently bugs, and when push comes to shove, its not a very sexy thing to 
 work on/fix, so its hard to get any help.
 Currently our back compat promise is just a broken promise, because we 
 cannot actually guarantee it for these reasons.
 I propose we scale back the length of time for which we must support old 
 indexes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-11 Thread Tim Smith (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130307#comment-14130307
]

Tim Smith commented on LUCENE-5940:
---

i would not consider old indexes not containing support for new features an
issue.
if you want to use new options/features/structures, you need to reindex, no
problem here.

you don't have to convince me that supporting back compat sucks. i agree, but
lucene is used by a lot of people for a lot of disparate use cases. removing
support for back compat will drive people away since it removes seamless
upgrade paths.

think what would have happened if microsoft release 64-bit windows with no
support for running old 32-bit programs.
people still want to run old dos programs on windows (go figure, but they
want/need it)

it hurts adoption of new versions if you don't provide the back compat. this
just leaves a bunch of people running ancient versions of lucene because they
don't have any good upgrade path other than complete reindexing.

if there is a bug in feature x, a possible solution is to just remove
feature x, but this is gonna piss off everyone who relies on it, regardless
of how much you may personally hate feature x

the main thing i see as a challenge that you mention here is that you want (or
new features may require) refactoring the codec api.

this is an engineering challenge and would just require some thought out design
to decide what final api refactors should be needed to support flexibility,
addition of new features, and growth without requiring mucking with old codecs
in the future.

right now, the IndexWriter and codecs are pretty muddled together in some
cases. cleaning up these interfaces and making the codecs self contained should
be a goal for any refactors to allow future innovation/addition of features.

as a lucene user, if back compat is yanked and not provided in 5.0 for all 4.x
indexes, i will be extremely resistant to upgrade. I would be more inclined to
fork the latest 4.x and ditch 5.0. 5.0 would have to offer something REALLY
compelling to get me to adopt it.

change index backwards compatibility policy.

Key: LUCENE-5940
URL: https://issues.apache.org/jira/browse/LUCENE-5940
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-11 Thread Tim Smith (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130314#comment-14130314
]

Tim Smith commented on LUCENE-5940:
---

bq. Can you elaborate more? Your example of 50 indexes spread across many
machines doesn't make me understand how it would be difficult to run this tool.
I see the steps as:

here's the issues i would have with an upgrade tool approach here.

1. external network connectivity is not guaranteed
2. i have special metadata written in the segment metadata that is important
3. i use custom codec configuration that upgrade tool would need to use
4. replicated indexes need a lot of care
5. this tool would need to be run once for each directory containing an index,
for every node that contains indexes
- this is an ops nightmare since i won't personally be running the tool. this
leaves lots of room for user error that is avoided completely if the index
upgrade is seamless (via read only codecs for old versions)
6. custom directory implementations may muck up the works

in general, i don't see any way this upgrade tool would be useful to me
without repackaging and adding a ton of extra code to do all the things i need
to ensure a consistent index is emitted

change index backwards compatibility policy.

Key: LUCENE-5940
URL: https://issues.apache.org/jira/browse/LUCENE-5940
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-11 Thread Tim Smith (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130324#comment-14130324
]

Tim Smith commented on LUCENE-5940:
---

bq. Because you are not even considering the developer pain. The tests man,
maintaining the tests.

the pain will continue to exist, you are just shifting who feels it. again, i
get how painful it is, but best to have that pain felt at the source (and
handled properly and consistently by people who fully understand it) as opposed
to pushing it all downstream, polluting the waters

change index backwards compatibility policy.

Key: LUCENE-5940
URL: https://issues.apache.org/jira/browse/LUCENE-5940
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5940) change index backwards compatibility policy.

2014-09-11 Thread Tim Smith (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130333#comment-14130333
]

Tim Smith commented on LUCENE-5940:
---

bq. I don't care what happens on this issue, personally, I'm done working on
back compat completely until the policy changes. That includes the current
in-progress 4.10.1 release. I've done more than my fair share of fighting it,
and it just causes me endless frustration.

fully your prerogative, this is a volunteer community.

i'm just putting in my 2 cents here since a change here will really be painful
to me personally

of course i'm not a committer, so i have no final say

change index backwards compatibility policy.

Key: LUCENE-5940
URL: https://issues.apache.org/jira/browse/LUCENE-5940
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5569) Rename AtomicReader to LeafReader

2014-04-03 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959096#comment-13959096
 ] 

Tim Smith commented on LUCENE-5569:
---

-1

please don't do this

renaming things for the sake of renaming them is a horrible burden on people 
using these apis

for instance, every single minor version of lucene 4.x has broken api 
signatures, resulting in hours, or days worth of time to reconcile the changes

add in a major name change like this and it adds in significant noise to fixing 
any real compile errors and significantly complicates the porting process (it 
took me weeks to upgrade from lucene 3.x to 4.x, i don't want to do that again)

AtomicReader is a public api in lucene and should not be renamed just because a 
new name seems better

 Rename AtomicReader to LeafReader
 -

 Key: LUCENE-5569
 URL: https://issues.apache.org/jira/browse/LUCENE-5569
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Priority: Minor
 Fix For: 5.0


 See LUCENE-5527 for more context: several of us seem to prefer {{Leaf}} to 
 {{Atomic}}.
 Talking from my experience, I was a bit confused in the beginning that this 
 thing is named {{AtomicReader}}, since {{Atomic}} is otherwise used in Java 
 in the context of concurrency. So maybe renaming it to {{Leaf}} would help 
 remove this confusion and also carry the information that these readers are 
 used as leaves of top-level readers?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files

2014-03-07 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923910#comment-13923910
 ] 

Tim Smith commented on LUCENE-5492:
---

Here's what my test is doing:

1. unpacks lucene 3.x era index (has one segment in it)
2. opens IndexWriter on 3.x index
3. opens DirectoryReader using IndexWriter
4. Add 1 new document
5. commit IndexWriter
6. reopens DirectoryReader using IndexWriter
7. optimizes IndexWriter
8. commit optimized index
9. reopens DirectoryReader using IndexWriter

One thing of note is that i have a custom IndexDeletionPolicy
this policy will hold onto named commit points 
i hold onto the previous commit point at commit time, and then release it 
shortly after the commit is finished, once i have persisted my acceptance of 
the new commit point (calling deleteUnusedFiles() to purge it)




 IndexFileDeleter AssertionError in presence of *_upgraded.si files
 --

 Key: LUCENE-5492
 URL: https://issues.apache.org/jira/browse/LUCENE-5492
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.7
Reporter: Tim Smith
Assignee: Michael McCandless

 When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x 
 segments, i am seeing the following exception:
 {code}
 java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 
 pre-decrement for file _0_upgraded.si
 at 
 org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630)
 at 
 org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514)
 at 
 org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286)
 at 
 org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393)
 at 
 org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617)
 {code}
 I believe this is caused by IndexFileDeleter not being aware of the Lucene3x 
 Segment Infos Format (notably the _upgraded.si files created to upgrade an 
 old index)
 This is new in 4.7 and did not occur in 4.6.1
 Still trying to track down a workaround/fix



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files

2014-03-06 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922646#comment-13922646
 ] 

Tim Smith commented on LUCENE-5492:
---

Narrowing it down

definitely seeing a reference count issue

this only seems to occur when using DirectoryReader.open(IndexWriter ...) 
methods

for one particular commit point segments_4, i see the following refcount 
behavior:
* incref segments_4
** incref _0_upgraded.si refcount=3
** decref _0_upgraded.si refcount=2
* incref segments_4
** NOTE: _0_upgraded.si not incref'd this time
* ...
* delete segments_4
** decref _0_upgraded.si ERROR








 IndexFileDeleter AssertionError in presence of *_upgraded.si files
 --

 Key: LUCENE-5492
 URL: https://issues.apache.org/jira/browse/LUCENE-5492
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.7
Reporter: Tim Smith

 When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x 
 segments, i am seeing the following exception:
 {code}
 java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 
 pre-decrement for file _0_upgraded.si
 at 
 org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630)
 at 
 org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514)
 at 
 org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286)
 at 
 org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393)
 at 
 org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617)
 {code}
 I believe this is caused by IndexFileDeleter not being aware of the Lucene3x 
 Segment Infos Format (notably the _upgraded.si files created to upgrade an 
 old index)
 This is new in 4.7 and did not occur in 4.6.1
 Still trying to track down a workaround/fix



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files

2014-03-06 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922847#comment-13922847
 ] 

Tim Smith commented on LUCENE-5492:
---

that seems to be the culprit

in my IndexWriter subclass, i overrode incRefDeleter and decRefDeleter to be 
no-ops and it no longer fails horribly

hopefully this doesn't have any negative effects (looks like that was all that 
was in the patch on LUCENE-5434, so worst case scenario i just don't get to 
take advantage of the benefits there)



 IndexFileDeleter AssertionError in presence of *_upgraded.si files
 --

 Key: LUCENE-5492
 URL: https://issues.apache.org/jira/browse/LUCENE-5492
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.7
Reporter: Tim Smith

 When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x 
 segments, i am seeing the following exception:
 {code}
 java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 
 pre-decrement for file _0_upgraded.si
 at 
 org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630)
 at 
 org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514)
 at 
 org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286)
 at 
 org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393)
 at 
 org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617)
 {code}
 I believe this is caused by IndexFileDeleter not being aware of the Lucene3x 
 Segment Infos Format (notably the _upgraded.si files created to upgrade an 
 old index)
 This is new in 4.7 and did not occur in 4.6.1
 Still trying to track down a workaround/fix



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files

2014-03-05 Thread Tim Smith (JIRA)

Tim Smith created LUCENE-5492:
-

 Summary: IndexFileDeleter AssertionError in presence of 
*_upgraded.si files
 Key: LUCENE-5492
 URL: https://issues.apache.org/jira/browse/LUCENE-5492
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.7
Reporter: Tim Smith


When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x 
segments, i am seeing the following exception:

{code}
java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 
pre-decrement for file _0_upgraded.si
at 
org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630)
at 
org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514)
at 
org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286)
at 
org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393)
at 
org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617)
{code}

I believe this is caused by IndexFileDeleter not being aware of the Lucene3x 
Segment Infos Format (notably the _upgraded.si files created to upgrade an 
old index)

This is new in 4.7 and did not occur in 4.6.1

Still trying to track down a workaround/fix



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files

2014-03-05 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921398#comment-13921398
 ] 

Tim Smith commented on LUCENE-5492:
---

to the best of my knowledge, i don't think its something crazy or wrong i'm 
doing on my part

still trying to get to the bottom of it

seems to be related to the accounting files in a SegmentInfos not behaving 
properly for legacy 3.x segments

 IndexFileDeleter AssertionError in presence of *_upgraded.si files
 --

 Key: LUCENE-5492
 URL: https://issues.apache.org/jira/browse/LUCENE-5492
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.7
Reporter: Tim Smith

 When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x 
 segments, i am seeing the following exception:
 {code}
 java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 
 pre-decrement for file _0_upgraded.si
 at 
 org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630)
 at 
 org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514)
 at 
 org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286)
 at 
 org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393)
 at 
 org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617)
 {code}
 I believe this is caused by IndexFileDeleter not being aware of the Lucene3x 
 Segment Infos Format (notably the _upgraded.si files created to upgrade an 
 old index)
 This is new in 4.7 and did not occur in 4.6.1
 Still trying to track down a workaround/fix



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files

2014-03-05 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921432#comment-13921432
 ] 

Tim Smith edited comment on LUCENE-5492 at 3/5/14 9:18 PM:
---

The following FileNotFound exception is firing:
{code}
java.io.FileNotFoundException: target/data-16000/mockEngine/index/_0.si (No 
such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:241)
at 
org.apache.lucene.store.FSDirectory$FSIndexInput.init(FSDirectory.java:382)
at 
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.init(NIOFSDirectory.java:127)
at 
org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
at 
org.apache.lucene.codecs.lucene3x.Lucene3xSegmentInfoReader.read(Lucene3xSegmentInfoReader.java:103)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:340)
at 
org.apache.lucene.index.IndexFileDeleter.init(IndexFileDeleter.java:175)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:773)
{code}

this results in IndexFileDeleter ignoring the segment (skipping incRef()), 
resulting in 0 refcounts for said files

then, the CommitPoint is deleted (which does reference the files properly), and 
the files are decRef'd, resulting in the exception


was (Author: tsmith):
The following FileNotFound exception is firing:
{code}
java.io.FileNotFoundException: 
/home/tsmith/src/attivio/app/target/data-16000/mockEngine/index/_0.si (No such 
file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:241)
at 
org.apache.lucene.store.FSDirectory$FSIndexInput.init(FSDirectory.java:382)
at 
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.init(NIOFSDirectory.java:127)
at 
org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
at 
org.apache.lucene.codecs.lucene3x.Lucene3xSegmentInfoReader.read(Lucene3xSegmentInfoReader.java:103)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:340)
at 
org.apache.lucene.index.IndexFileDeleter.init(IndexFileDeleter.java:175)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:773)
{code}

this results in IndexFileDeleter ignoring the segment (skipping incRef()), 
resulting in 0 refcounts for said files

then, the CommitPoint is deleted (which does reference the files properly), and 
the files are decRef'd, resulting in the exception

 IndexFileDeleter AssertionError in presence of *_upgraded.si files
 --

 Key: LUCENE-5492
 URL: https://issues.apache.org/jira/browse/LUCENE-5492
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.7
Reporter: Tim Smith

 When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x 
 segments, i am seeing the following exception:
 {code}
 java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 
 pre-decrement for file _0_upgraded.si
 at 
 org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630)
 at 
 org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514)
 at 
 org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286)
 at 
 org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393)
 at 
 org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617)
 {code}
 I believe this is caused by IndexFileDeleter not being aware of the Lucene3x 
 Segment Infos Format (notably the _upgraded.si files created to upgrade an 
 old index)
 This is new in 4.7 and did not occur in 4.6.1
 Still trying to track down a workaround/fix



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files

2014-03-05 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921432#comment-13921432
 ] 

Tim Smith commented on LUCENE-5492:
---

The following FileNotFound exception is firing:
{code}
java.io.FileNotFoundException: 
/home/tsmith/src/attivio/app/target/data-16000/mockEngine/index/_0.si (No such 
file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:241)
at 
org.apache.lucene.store.FSDirectory$FSIndexInput.init(FSDirectory.java:382)
at 
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.init(NIOFSDirectory.java:127)
at 
org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
at 
org.apache.lucene.codecs.lucene3x.Lucene3xSegmentInfoReader.read(Lucene3xSegmentInfoReader.java:103)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:340)
at 
org.apache.lucene.index.IndexFileDeleter.init(IndexFileDeleter.java:175)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:773)
{code}

this results in IndexFileDeleter ignoring the segment (skipping incRef()), 
resulting in 0 refcounts for said files

then, the CommitPoint is deleted (which does reference the files properly), and 
the files are decRef'd, resulting in the exception

 IndexFileDeleter AssertionError in presence of *_upgraded.si files
 --

 Key: LUCENE-5492
 URL: https://issues.apache.org/jira/browse/LUCENE-5492
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.7
Reporter: Tim Smith

 When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x 
 segments, i am seeing the following exception:
 {code}
 java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 
 pre-decrement for file _0_upgraded.si
 at 
 org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630)
 at 
 org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514)
 at 
 org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286)
 at 
 org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393)
 at 
 org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617)
 {code}
 I believe this is caused by IndexFileDeleter not being aware of the Lucene3x 
 Segment Infos Format (notably the _upgraded.si files created to upgrade an 
 old index)
 This is new in 4.7 and did not occur in 4.6.1
 Still trying to track down a workaround/fix



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files

2014-03-05 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921452#comment-13921452
 ] 

Tim Smith commented on LUCENE-5492:
---

FileNotFound was actually triggered later (as things were shutting down, after 
the initial assertion tripped)

my current theory is that the .si and _upgraded.si files are not being 
registered into the index file deleter properly, or it is somehow double 
decref'd

i see the upgraded.si and .si file get decref'd and deleted, followed by 
another decref, which trips the assert

 IndexFileDeleter AssertionError in presence of *_upgraded.si files
 --

 Key: LUCENE-5492
 URL: https://issues.apache.org/jira/browse/LUCENE-5492
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.7
Reporter: Tim Smith

 When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x 
 segments, i am seeing the following exception:
 {code}
 java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 
 pre-decrement for file _0_upgraded.si
 at 
 org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630)
 at 
 org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514)
 at 
 org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286)
 at 
 org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393)
 at 
 org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617)
 {code}
 I believe this is caused by IndexFileDeleter not being aware of the Lucene3x 
 Segment Infos Format (notably the _upgraded.si files created to upgrade an 
 old index)
 This is new in 4.7 and did not occur in 4.6.1
 Still trying to track down a workaround/fix



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4671) CharsRef.subSequence broken

2013-01-09 Thread Tim Smith (JIRA)

Tim Smith created LUCENE-4671:
-

 Summary: CharsRef.subSequence broken
 Key: LUCENE-4671
 URL: https://issues.apache.org/jira/browse/LUCENE-4671
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Tim Smith


Looks like CharsRef.subSequence() is currently broken

It is implemented as:
{code}
  @Override
  public CharSequence subSequence(int start, int end) {
// NOTE: must do a real check here to meet the specs of CharSequence
if (start  0 || end  length || start  end) {
  throw new IndexOutOfBoundsException();
}
return new CharsRef(chars, offset + start, offset + end);
  }
{code}

Since CharsRef constructor is (char[] chars, int offset, int length),
Should Be:
{code}
  @Override
  public CharSequence subSequence(int start, int end) {
// NOTE: must do a real check here to meet the specs of CharSequence
if (start  0 || end  length || start  end) {
  throw new IndexOutOfBoundsException();
}
return new CharsRef(chars, offset + start, end - start);
  }
{code}



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4671) CharsRef.subSequence broken

2013-01-09 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548841#comment-13548841
 ] 

Tim Smith commented on LUCENE-4671:
---

looks like the index out of bounds check is a bit off too (if someone ever uses 
non-zero offsets)

check should probably be:
{code}
if (start  offset || end  (offset + length) || start  end) {
  throw new IndexOutOfBoundsException();
}
{code}

 CharsRef.subSequence broken
 ---

 Key: LUCENE-4671
 URL: https://issues.apache.org/jira/browse/LUCENE-4671
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Tim Smith
Assignee: Robert Muir
 Attachments: LUCENE-4671.patch


 Looks like CharsRef.subSequence() is currently broken
 It is implemented as:
 {code}
   @Override
   public CharSequence subSequence(int start, int end) {
 // NOTE: must do a real check here to meet the specs of CharSequence
 if (start  0 || end  length || start  end) {
   throw new IndexOutOfBoundsException();
 }
 return new CharsRef(chars, offset + start, offset + end);
   }
 {code}
 Since CharsRef constructor is (char[] chars, int offset, int length),
 Should Be:
 {code}
   @Override
   public CharSequence subSequence(int start, int end) {
 // NOTE: must do a real check here to meet the specs of CharSequence
 if (start  0 || end  length || start  end) {
   throw new IndexOutOfBoundsException();
 }
 return new CharsRef(chars, offset + start, end - start);
   }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Deleted] (LUCENE-4671) CharsRef.subSequence broken

2013-01-09 Thread Tim Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-4671:
--

Comment: was deleted

(was: looks like the index out of bounds check is a bit off too (if someone 
ever uses non-zero offsets)

check should probably be:
{code}
if (start  offset || end  (offset + length) || start  end) {
  throw new IndexOutOfBoundsException();
}
{code})

 CharsRef.subSequence broken
 ---

 Key: LUCENE-4671
 URL: https://issues.apache.org/jira/browse/LUCENE-4671
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Tim Smith
Assignee: Robert Muir
 Attachments: LUCENE-4671.patch


 Looks like CharsRef.subSequence() is currently broken
 It is implemented as:
 {code}
   @Override
   public CharSequence subSequence(int start, int end) {
 // NOTE: must do a real check here to meet the specs of CharSequence
 if (start  0 || end  length || start  end) {
   throw new IndexOutOfBoundsException();
 }
 return new CharsRef(chars, offset + start, offset + end);
   }
 {code}
 Since CharsRef constructor is (char[] chars, int offset, int length),
 Should Be:
 {code}
   @Override
   public CharSequence subSequence(int start, int end) {
 // NOTE: must do a real check here to meet the specs of CharSequence
 if (start  0 || end  length || start  end) {
   throw new IndexOutOfBoundsException();
 }
 return new CharsRef(chars, offset + start, end - start);
   }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4671) CharsRef.subSequence broken

2013-01-09 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548843#comment-13548843
 ] 

Tim Smith commented on LUCENE-4671:
---

looks good

 CharsRef.subSequence broken
 ---

 Key: LUCENE-4671
 URL: https://issues.apache.org/jira/browse/LUCENE-4671
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Tim Smith
Assignee: Robert Muir
 Attachments: LUCENE-4671.patch


 Looks like CharsRef.subSequence() is currently broken
 It is implemented as:
 {code}
   @Override
   public CharSequence subSequence(int start, int end) {
 // NOTE: must do a real check here to meet the specs of CharSequence
 if (start  0 || end  length || start  end) {
   throw new IndexOutOfBoundsException();
 }
 return new CharsRef(chars, offset + start, offset + end);
   }
 {code}
 Since CharsRef constructor is (char[] chars, int offset, int length),
 Should Be:
 {code}
   @Override
   public CharSequence subSequence(int start, int end) {
 // NOTE: must do a real check here to meet the specs of CharSequence
 if (start  0 || end  length || start  end) {
   throw new IndexOutOfBoundsException();
 }
 return new CharsRef(chars, offset + start, end - start);
   }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4671) CharsRef.subSequence broken

2013-01-09 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548848#comment-13548848
 ] 

Tim Smith commented on LUCENE-4671:
---

it is, that's why i deleted the comment, just looked wrong to me for a moment

 CharsRef.subSequence broken
 ---

 Key: LUCENE-4671
 URL: https://issues.apache.org/jira/browse/LUCENE-4671
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Tim Smith
Assignee: Robert Muir
 Attachments: LUCENE-4671.patch


 Looks like CharsRef.subSequence() is currently broken
 It is implemented as:
 {code}
   @Override
   public CharSequence subSequence(int start, int end) {
 // NOTE: must do a real check here to meet the specs of CharSequence
 if (start  0 || end  length || start  end) {
   throw new IndexOutOfBoundsException();
 }
 return new CharsRef(chars, offset + start, offset + end);
   }
 {code}
 Since CharsRef constructor is (char[] chars, int offset, int length),
 Should Be:
 {code}
   @Override
   public CharSequence subSequence(int start, int end) {
 // NOTE: must do a real check here to meet the specs of CharSequence
 if (start  0 || end  length || start  end) {
   throw new IndexOutOfBoundsException();
 }
 return new CharsRef(chars, offset + start, end - start);
   }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge

2012-12-20 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537132#comment-13537132
 ] 

Tim Smith commented on LUCENE-4560:
---

i found a 100% pure codec approach for providing all the functionality i 
require here and more, requiring no patches


if any committer has interest in pushing this ticket forward, i can clean up 
patch/add suggestions, etc, otherwise this ticket can be closed

 Support Filtering Segments During Merge
 ---

 Key: LUCENE-4560
 URL: https://issues.apache.org/jira/browse/LUCENE-4560
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tim Smith
 Attachments: LUCENE-4560.patch, LUCENE-4560-simple.patch


 Spun off from LUCENE-4557
 It is desirable to be able to filter segments during merge.
 Most often, full reindex of content is not possible.
 Merging segments can sometimes have negative consequences when fields are 
 have different options (most restrictive option is forced during merge)
 Being able to filter segments during merges will allow gradually migrating 
 indexed data to new index settings, support pruning/enhancing existing data 
 gradually
 Use Cases:
 * Migrate IndexOptions for fields (See LUCENE-4557)
 * Gradually Remove index fields no longer used
 * Migrate indexed sort fields to DocValues
 * Support converting data types for indexed data
 * and so on
 patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge

2012-12-20 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537223#comment-13537223
 ] 

Tim Smith commented on LUCENE-4560:
---

codec approach i'm taking is pretty specific, incorporating my 
schema/configuration to allow migrating/enhancing options/features/indexing 
formats/etc (still exploring all the possibilities)

there may be a few things that would reduce the overhead/enhance the ease the 
implementation.
i will create new tickets with patches as i identify them.

NOTE: the codec api is very nice. congrats to all involved in making that 
happen.

 Support Filtering Segments During Merge
 ---

 Key: LUCENE-4560
 URL: https://issues.apache.org/jira/browse/LUCENE-4560
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tim Smith
 Attachments: LUCENE-4560.patch, LUCENE-4560-simple.patch


 Spun off from LUCENE-4557
 It is desirable to be able to filter segments during merge.
 Most often, full reindex of content is not possible.
 Merging segments can sometimes have negative consequences when fields are 
 have different options (most restrictive option is forced during merge)
 Being able to filter segments during merges will allow gradually migrating 
 indexed data to new index settings, support pruning/enhancing existing data 
 gradually
 Use Cases:
 * Migrate IndexOptions for fields (See LUCENE-4557)
 * Gradually Remove index fields no longer used
 * Migrate indexed sort fields to DocValues
 * Support converting data types for indexed data
 * and so on
 patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4272) another idea for updatable fields

2012-12-20 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537449#comment-13537449
 ] 

Tim Smith commented on LUCENE-4272:
---

+1 on term vector approach

I would like to see the following added to IndexableField:
/** Expert. index inverted terms for field */
public Terms invertedTerms();

this would allow partial updates via term vectors without having to flatten 
back into TokenStream first

This would also facilitate things like the following:
* index document into memory index
* run alert queries/per-doc analysis against memory index
* get terms from memory index for all fields and index into on disk index 
using IndexableField.invertedTerms()
* double tokenization/analysis/inversion is now avoided 









 another idea for updatable fields
 -

 Key: LUCENE-4272
 URL: https://issues.apache.org/jira/browse/LUCENE-4272
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir

 I've been reviewing the ideas for updatable fields and have an alternative
 proposal that I think would address my biggest concern:
 * not slowing down searching
 When I look at what Solr and Elasticsearch do here, by basically reindexing 
 from stored fields, I think they solve a lot of the problem: users don't have 
 to rebuild their document from scratch just to update one tiny piece.
 But I think we can do this more efficiently: by avoiding reindexing of the 
 unaffected fields.
 The basic idea is that we would require term vectors for this approach (as 
 the already store a serialized indexed version of the doc), and so we could 
 just take the other pieces from the existing vectors for the doc.
 I think we would have to extend vectors to also store the norm (so we dont 
 recompute that), and payloads, but it seems feasible at a glance.
 I dont think we should discard the idea because vectors are slow/big today, 
 this seems like something we could fix.
 Personally I like the idea of not slowing down search performance to solve 
 the problem, I think we should really start from that angle and work towards 
 making the indexing side more efficient, not vice-versa.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge

2012-11-19 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500226#comment-13500226
 ] 

Tim Smith commented on LUCENE-4560:
---

The gradual approach is very much required.
Its possible that a config change by a user will result in the need to do a 
filtered reader on a merge.

For instance, if you index a field without offsets, then you shutdown, start up 
with indexing of offsets.
Currently, this situation will result in newly indexed offsets being 
obliterated on merge (LUCENE-4557) with no possible way to save them.

Especially in this case, the addIndexes() approach is way too costly just for a 
small configuration change.
Small config changes shouldn't require the equivalent of a full optimize to 
take effect.


Also, i argue that any addIndexes() approach is even more dangerous and just as 
prone to corruption.
This can result in the same filtering of readers as the attached patch 
provides, however it modifies the entire index, thereby causing any corruption 
to be much more widespread. (of course either way, it is up to the person 
implementing their custom filter to guarantee that no corruption occurs and 
that their code produces consistent indexes)


I will look into the MergePolicy approach.
Off hand, it looks like this may still require a patch as the SegmentMerger is 
currently only aware of SegmentReaders from merging,
however i may be able to add my own SegmentInfo's to the OneMerge replacing the 
codec with a wrapped codec that will apply my filtering.
it'll be about a week before i can get back to testing this, i'll report back 
then.





 Support Filtering Segments During Merge
 ---

 Key: LUCENE-4560
 URL: https://issues.apache.org/jira/browse/LUCENE-4560
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tim Smith
 Attachments: LUCENE-4560.patch


 Spun off from LUCENE-4557
 It is desirable to be able to filter segments during merge.
 Most often, full reindex of content is not possible.
 Merging segments can sometimes have negative consequences when fields are 
 have different options (most restrictive option is forced during merge)
 Being able to filter segments during merges will allow gradually migrating 
 indexed data to new index settings, support pruning/enhancing existing data 
 gradually
 Use Cases:
 * Migrate IndexOptions for fields (See LUCENE-4557)
 * Gradually Remove index fields no longer used
 * Migrate indexed sort fields to DocValues
 * Support converting data types for indexed data
 * and so on
 patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge

2012-11-19 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500275#comment-13500275
 ] 

Tim Smith commented on LUCENE-4560:
---

offsets can be used for highlighting
users want to configure highlighting per field
users don't always know what fields they want to highlight and may change this 
setting frequently
setting highlighting=true on a field should be fully possible without full 
reindex required (old documents of course will not be highlighted, or may 
default to a slower highlighting method that does not require offsets) (slowly 
refeeding old documents will allow users to get full functionality for old docs 
as well, however refeeding may take weeks and should not impact indexing of new 
content)

i can't proactively always enable offsets on the off chance they will enable 
highlighting in the future as this implies additional disk requirements

this is the primary use case that spawned this ticket
right now, due to the merging behavior, i cannot use indexed offsets for 
highlighting as a setting change will result in merges destroying offsets.

this filtering merge reader approach also fulfills other requirements i have 
for migrating old indexed content to use new features so it would be a win-win 
for me to use this filtered merge reader approach to ensure consistency and 
conformance with my schema.



 Support Filtering Segments During Merge
 ---

 Key: LUCENE-4560
 URL: https://issues.apache.org/jira/browse/LUCENE-4560
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tim Smith
 Attachments: LUCENE-4560.patch


 Spun off from LUCENE-4557
 It is desirable to be able to filter segments during merge.
 Most often, full reindex of content is not possible.
 Merging segments can sometimes have negative consequences when fields are 
 have different options (most restrictive option is forced during merge)
 Being able to filter segments during merges will allow gradually migrating 
 indexed data to new index settings, support pruning/enhancing existing data 
 gradually
 Use Cases:
 * Migrate IndexOptions for fields (See LUCENE-4557)
 * Gradually Remove index fields no longer used
 * Migrate indexed sort fields to DocValues
 * Support converting data types for indexed data
 * and so on
 patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge

2012-11-19 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500291#comment-13500291
 ] 

Tim Smith commented on LUCENE-4560:
---

bq. Its been this way since even 2.x, if you omitTF, then later decide you want 
TF and positions, you need to re-index.

re-index is the key word here
re-indexing is not something that can always be done, or implies a massive cost.
changing a schema setting for one field should not require a full re-index.
i'm afraid i'm in a world where re-index is a 4 letter word and should only be 
done in the most extreme of circumstances.
my whole point here is that migration should be possible via a pluggable policy

bq. Thats why these are expert options.

i know these are expert options, but there should also be a means to support 
migration to new settings (albeit an expert means to do so that may have some 
consequences for how old documents were indexed)










 Support Filtering Segments During Merge
 ---

 Key: LUCENE-4560
 URL: https://issues.apache.org/jira/browse/LUCENE-4560
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tim Smith
 Attachments: LUCENE-4560.patch


 Spun off from LUCENE-4557
 It is desirable to be able to filter segments during merge.
 Most often, full reindex of content is not possible.
 Merging segments can sometimes have negative consequences when fields are 
 have different options (most restrictive option is forced during merge)
 Being able to filter segments during merges will allow gradually migrating 
 indexed data to new index settings, support pruning/enhancing existing data 
 gradually
 Use Cases:
 * Migrate IndexOptions for fields (See LUCENE-4557)
 * Gradually Remove index fields no longer used
 * Migrate indexed sort fields to DocValues
 * Support converting data types for indexed data
 * and so on
 patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge

2012-11-19 Thread Tim Smith (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500318#comment-13500318
]

Tim Smith commented on LUCENE-4560:
---

A migration strategy does exist and is very simple. It is up to the implementer
to determine how data will be migrated and properly communicate that to the
user base so expectations are set properly. All migration will have pros and
cons, and my require gradual reindexing of content to ensure consistency for
old documents. but this is up to the implementer, and shouldn't be imposed by
the lucene apis.

Lets analyze the highlighting case based on indexed offsets.

Assume documents were indexed with no offsets.
Highlighting was being done for these documents using tokenstream based
highlighting based on stored field text.

Now, the user switches to using a more efficient offsets based highlighting.
new documents will be indexed with offsets.

Right now, assuming no merging was done, it is very easy to see if a document
has indexed offsets and on a per-document basis documents can be highlighted
according to what was indexed.

Then a merge happens. (currently, this will force tokenstream based
highlighting for all documents, undoing the configuration setting)

If applying a migration policy, old documents can have 0,0 offsets applied.
(this is the decision of the migration policy and is up to the implementer of
the migration policy)
Now, when highlighting is applied, if all positions have a 0,0 offset for a
document, it can fall back to tokenstream based highlighting.
if positions have offsets, it will use them to perform optimal, full-featured
highlighting.

This will result in slightly slower highlighting for old documents.
user experience can then be improved by doing a gradual reindex of old
documents, without requiring user to blast away their existing index.

Support Filtering Segments During Merge
---

Key: LUCENE-4560
URL: https://issues.apache.org/jira/browse/LUCENE-4560
Project: Lucene - Core
Issue Type: Improvement
Reporter: Tim Smith
Attachments: LUCENE-4560.patch

Spun off from LUCENE-4557
It is desirable to be able to filter segments during merge.
Most often, full reindex of content is not possible.
Merging segments can sometimes have negative consequences when fields are
have different options (most restrictive option is forced during merge)
Being able to filter segments during merges will allow gradually migrating
indexed data to new index settings, support pruning/enhancing existing data
gradually
Use Cases:
* Migrate IndexOptions for fields (See LUCENE-4557)
* Gradually Remove index fields no longer used
* Migrate indexed sort fields to DocValues
* Support converting data types for indexed data
* and so on
patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge

2012-11-19 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500322#comment-13500322
 ] 

Tim Smith commented on LUCENE-4560:
---

Uwe,

i plan to investigate your suggestions, and it may result in not requiring any 
additional patching to lucene.
it'll be about a week before i can get to that and i will post my results then.
i still don't see the addIndexes() approach as viable, even how you suggest as 
that will require up front migration steps instead of gradual migration. The 
merge policy approach you suggested will likely be more useful to me, however 
this will be a nasty merge policy.

 Support Filtering Segments During Merge
 ---

 Key: LUCENE-4560
 URL: https://issues.apache.org/jira/browse/LUCENE-4560
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tim Smith
 Attachments: LUCENE-4560.patch, LUCENE-4560-simple.patch


 Spun off from LUCENE-4557
 It is desirable to be able to filter segments during merge.
 Most often, full reindex of content is not possible.
 Merging segments can sometimes have negative consequences when fields are 
 have different options (most restrictive option is forced during merge)
 Being able to filter segments during merges will allow gradually migrating 
 indexed data to new index settings, support pruning/enhancing existing data 
 gradually
 Use Cases:
 * Migrate IndexOptions for fields (See LUCENE-4557)
 * Gradually Remove index fields no longer used
 * Migrate indexed sort fields to DocValues
 * Support converting data types for indexed data
 * and so on
 patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4560) Support Filtering Segments During Merge

2012-11-18 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1346#comment-1346
 ] 

Tim Smith commented on LUCENE-4560:
---

My base requirement here is that this be an online process.
As such, the add indexes approach is really not useful as i see it, especially 
as it requires 2x disk space, as well as completely new index directories, it 
does not play well with upgrading a user's existing index.

what i see as needed is the ability to gradually migrate indexes such that any 
individual segment is itself consistent.
currently, merging of indexes can result in loss of indexed data or otherwise 
break consistency, as in LUCENE-4557

it is 100% ok if all segments have not been processed as i can identify each 
segment's settings at index open/search time, and optionally filter/search/read 
segments differently.

It is true that once you start using this SegmentMergeFilter, you pretty much 
have to keep using it forever.
I don't see this as an issue as when dealing with supporting old indexes, you 
constantly have to support migration of data that was indexed using old code. 
For instance, as time goes on, my MergeSegmentFilter will do more, supporting 
migrating more and more old index formats/config settings to the latest 
indexing format/settings.

At quick glance, FilteringCodec looks like it applies to writing new content, 
not reading existing indexes?
Doesn't seem quite like that would do the trick here. I would need some way to 
have the index writer wrap the codec for existing segments in order to inject 
my custom filtering that would apply during merging. That would be logically 
identical to the patch provided, however would potentially result in a much 
more complex patch.







 Support Filtering Segments During Merge
 ---

 Key: LUCENE-4560
 URL: https://issues.apache.org/jira/browse/LUCENE-4560
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tim Smith
 Attachments: LUCENE-4560.patch


 Spun off from LUCENE-4557
 It is desirable to be able to filter segments during merge.
 Most often, full reindex of content is not possible.
 Merging segments can sometimes have negative consequences when fields are 
 have different options (most restrictive option is forced during merge)
 Being able to filter segments during merges will allow gradually migrating 
 indexed data to new index settings, support pruning/enhancing existing data 
 gradually
 Use Cases:
 * Migrate IndexOptions for fields (See LUCENE-4557)
 * Gradually Remove index fields no longer used
 * Migrate indexed sort fields to DocValues
 * Support converting data types for indexed data
 * and so on
 patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4560) Support Filtering Segments During Merge

2012-11-16 Thread Tim Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-4560:
--

Lucene Fields: New,Patch Available  (was: New)

 Support Filtering Segments During Merge
 ---

 Key: LUCENE-4560
 URL: https://issues.apache.org/jira/browse/LUCENE-4560
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tim Smith
 Attachments: LUCENE-4560.patch


 Spun off from LUCENE-4557
 It is desirable to be able to filter segments during merge.
 Most often, full reindex of content is not possible.
 Merging segments can sometimes have negative consequences when fields are 
 have different options (most restrictive option is forced during merge)
 Being able to filter segments during merges will allow gradually migrating 
 indexed data to new index settings, support pruning/enhancing existing data 
 gradually
 Use Cases:
 * Migrate IndexOptions for fields (See LUCENE-4557)
 * Gradually Remove index fields no longer used
 * Migrate indexed sort fields to DocValues
 * Support converting data types for indexed data
 * and so on
 patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4560) Support Filtering Segments During Merge

2012-11-15 Thread Tim Smith (JIRA)

Tim Smith created LUCENE-4560:
-

 Summary: Support Filtering Segments During Merge
 Key: LUCENE-4560
 URL: https://issues.apache.org/jira/browse/LUCENE-4560
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tim Smith


Spun off from LUCENE-4557

It is desirable to be able to filter segments during merge.
Most often, full reindex of content is not possible.
Merging segments can sometimes have negative consequences when fields are have 
different options (most restrictive option is forced during merge)
Being able to filter segments during merges will allow gradually migrating 
indexed data to new index settings, support pruning/enhancing existing data 
gradually


Use Cases:
* Migrate IndexOptions for fields (See LUCENE-4557)
* Gradually Remove index fields no longer used
* Migrate indexed sort fields to DocValues
* Support converting data types for indexed data
* and so on

patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge

2012-11-15 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13498079#comment-13498079
 ] 

Tim Smith commented on LUCENE-4557:
---

Spun off LUCENE-4560 for supporting filtering during segment merging.
patch will be forthcoming shortly

as long as that gains traction and makes it in, i will be happy (this will 
actually fulfill numerous other use cases i have).

I still consider this issue a bug given indexed content is lost and would 
recommend against closing this ticket, however LUCENE-4560 will provide a more 
than adequate solution for my needs.

 Indexed Offsets Can Be Lost During Merge
 

 Key: LUCENE-4557
 URL: https://issues.apache.org/jira/browse/LUCENE-4557
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Tim Smith
 Attachments: OffsetsTest.java


 Primary Use case:
 Start with pre-4.0 index (no indexed offsets available)
 Start indexing new documents with indexed offsets 
 (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was 
 IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
 merge/optimize index
 newly indexed documents will now no longer have offsets available
 In general, it is impossible to ever change a field to have offsets indexed 
 when starting with an existing index as a merge will cause offsets to be 
 removed from the index.
 Desirable behavior would be for new documents to have offsets indexed 
 properly, and old documents would have offset of 0, 0 for all positions 
 after merging with a segment that contains offsets
 Current behavior can be very dangerous.
 for example:
 * Start indexing documents with indexed offsets
 * change config to not index offsets by accident
 * index 1 document
 * revert config back
 * offsets will start disappearing from documents as segments are merged

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4560) Support Filtering Segments During Merge

2012-11-15 Thread Tim Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-4560:
--

Attachment: LUCENE-4560.patch

Attaching patch

patch adds MergeSegmentFilter base class and adds config setter akin to 
IndexReaderWarmer on IndexWriterConfig (by all means, suggest better names)

SegmentMerger will use this (if specified) to filter any segments being merged

Test case included that uses filter to remove an indexed field during merge.





 Support Filtering Segments During Merge
 ---

 Key: LUCENE-4560
 URL: https://issues.apache.org/jira/browse/LUCENE-4560
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tim Smith
 Attachments: LUCENE-4560.patch


 Spun off from LUCENE-4557
 It is desirable to be able to filter segments during merge.
 Most often, full reindex of content is not possible.
 Merging segments can sometimes have negative consequences when fields are 
 have different options (most restrictive option is forced during merge)
 Being able to filter segments during merges will allow gradually migrating 
 indexed data to new index settings, support pruning/enhancing existing data 
 gradually
 Use Cases:
 * Migrate IndexOptions for fields (See LUCENE-4557)
 * Gradually Remove index fields no longer used
 * Migrate indexed sort fields to DocValues
 * Support converting data types for indexed data
 * and so on
 patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4560) Support Filtering Segments During Merge

2012-11-15 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13498087#comment-13498087
 ] 

Tim Smith edited comment on LUCENE-4560 at 11/15/12 4:03 PM:
-

Attaching patch

patch adds MergedSegmentFilter base class and adds config setter akin to 
IndexReaderWarmer on IndexWriterConfig (by all means, suggest better names)

SegmentMerger will use this (if specified) to filter any segments being merged

Test case included that uses filter to remove an indexed field during merge.





  was (Author: tsmith):
Attaching patch

patch adds MergeSegmentFilter base class and adds config setter akin to 
IndexReaderWarmer on IndexWriterConfig (by all means, suggest better names)

SegmentMerger will use this (if specified) to filter any segments being merged

Test case included that uses filter to remove an indexed field during merge.




  
 Support Filtering Segments During Merge
 ---

 Key: LUCENE-4560
 URL: https://issues.apache.org/jira/browse/LUCENE-4560
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tim Smith
 Attachments: LUCENE-4560.patch


 Spun off from LUCENE-4557
 It is desirable to be able to filter segments during merge.
 Most often, full reindex of content is not possible.
 Merging segments can sometimes have negative consequences when fields are 
 have different options (most restrictive option is forced during merge)
 Being able to filter segments during merges will allow gradually migrating 
 indexed data to new index settings, support pruning/enhancing existing data 
 gradually
 Use Cases:
 * Migrate IndexOptions for fields (See LUCENE-4557)
 * Gradually Remove index fields no longer used
 * Migrate indexed sort fields to DocValues
 * Support converting data types for indexed data
 * and so on
 patch will be forthcoming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge

2012-11-14 Thread Tim Smith (JIRA)

Tim Smith created LUCENE-4557:
-

 Summary: Indexed Offsets Can Be Lost During Merge
 Key: LUCENE-4557
 URL: https://issues.apache.org/jira/browse/LUCENE-4557
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Tim Smith
 Attachments: OffsetsTest.java

Primary Use case:
Start with pre-4.0 index (no indexed offsets available)
Start indexing new documents with indexed offsets 
(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
merge/optimize index

newly indexed documents will now no longer have offsets available

In general, it is impossible to ever change a field to have offsets indexed 
when starting with an existing index as a merge will cause offsets to be 
removed from the index.

Desirable behavior would be for new documents to have offsets indexed properly, 
and old documents would have offset of 0, 0 for all positions after merging 
with a segment that contains offsets

Current behavior can be very dangerous.
for example:
* Start indexing documents with indexed offsets
* change config to not index offsets by accident
* index 1 document
* revert config back
* offsets will start disappearing from documents as segments are merged












--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge

2012-11-14 Thread Tim Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-4557:
--

Attachment: OffsetsTest.java

Attaching test that shows issue



 Indexed Offsets Can Be Lost During Merge
 

 Key: LUCENE-4557
 URL: https://issues.apache.org/jira/browse/LUCENE-4557
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Tim Smith
 Attachments: OffsetsTest.java


 Primary Use case:
 Start with pre-4.0 index (no indexed offsets available)
 Start indexing new documents with indexed offsets 
 (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was 
 IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
 merge/optimize index
 newly indexed documents will now no longer have offsets available
 In general, it is impossible to ever change a field to have offsets indexed 
 when starting with an existing index as a merge will cause offsets to be 
 removed from the index.
 Desirable behavior would be for new documents to have offsets indexed 
 properly, and old documents would have offset of 0, 0 for all positions 
 after merging with a segment that contains offsets
 Current behavior can be very dangerous.
 for example:
 * Start indexing documents with indexed offsets
 * change config to not index offsets by accident
 * index 1 document
 * revert config back
 * offsets will start disappearing from documents as segments are merged

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Reopened] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge

2012-11-14 Thread Tim Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith reopened LUCENE-4557:
---


I disagree with that assessment

the problem here is not that offsets are not available on old docs 

the problem is that offsets are destroyed on documents that had them set 
properly. This is very much a bug.
a small temporary config mistake by a user can cause destruction of indexed 
data during merging. even after corrected.

As far as i'm concerned, this issue makes it unfeasible to ever use indexed 
offsets even though i very much want to.

reindexing data is quite often out of the question when large indexes are 
involved.







 Indexed Offsets Can Be Lost During Merge
 

 Key: LUCENE-4557
 URL: https://issues.apache.org/jira/browse/LUCENE-4557
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Tim Smith
 Attachments: OffsetsTest.java


 Primary Use case:
 Start with pre-4.0 index (no indexed offsets available)
 Start indexing new documents with indexed offsets 
 (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was 
 IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
 merge/optimize index
 newly indexed documents will now no longer have offsets available
 In general, it is impossible to ever change a field to have offsets indexed 
 when starting with an existing index as a merge will cause offsets to be 
 removed from the index.
 Desirable behavior would be for new documents to have offsets indexed 
 properly, and old documents would have offset of 0, 0 for all positions 
 after merging with a segment that contains offsets
 Current behavior can be very dangerous.
 for example:
 * Start indexing documents with indexed offsets
 * change config to not index offsets by accident
 * index 1 document
 * revert config back
 * offsets will start disappearing from documents as segments are merged

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge

2012-11-14 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497208#comment-13497208
 ] 

Tim Smith commented on LUCENE-4557:
---

i understand the similarity to the omitTF case, but would argue that too is a 
bug

the main issue here is with merging

merging currently seems to choose the most restrictive case for IndexOptions 
for a field instead of the most general 

when you are writing new segments and you provide contradictory IndexOptions 
for the same field, it is ok for the writer to produce new segments with the 
most restrictive set (or throw an exception at this point), i have no argument 
there

however, when it comes to merging existing segments, no indexed data should be 
lost (as in this case)

if you have 2 segments with the following:
Segment 1: docs and freqs and positions
Segment 2: docs and freqs and positions and offsets

the merged segment should have the following
Merged: docs and freqs and positions and offsets

the offsets for docs that were part of segment 1 should be null/(start=0, 
end=0), or better yet (-1, -1) if possible
the offsets for docs that were part of segment 2 should be the proper offsets 
that were indexed for segment 2 in the first place

The same rule could also be applied to the omit tf case:
Segment 1: Docs Only
Segment 2: Docs And Freqs And Positions

Merged: docs and freqs and positions
docs from segment 1 should have frequency 1 and a single position of 0













 Indexed Offsets Can Be Lost During Merge
 

 Key: LUCENE-4557
 URL: https://issues.apache.org/jira/browse/LUCENE-4557
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Tim Smith
 Attachments: OffsetsTest.java


 Primary Use case:
 Start with pre-4.0 index (no indexed offsets available)
 Start indexing new documents with indexed offsets 
 (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was 
 IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
 merge/optimize index
 newly indexed documents will now no longer have offsets available
 In general, it is impossible to ever change a field to have offsets indexed 
 when starting with an existing index as a merge will cause offsets to be 
 removed from the index.
 Desirable behavior would be for new documents to have offsets indexed 
 properly, and old documents would have offset of 0, 0 for all positions 
 after merging with a segment that contains offsets
 Current behavior can be very dangerous.
 for example:
 * Start indexing documents with indexed offsets
 * change config to not index offsets by accident
 * index 1 document
 * revert config back
 * offsets will start disappearing from documents as segments are merged

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge

2012-11-14 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497278#comment-13497278
 ] 

Tim Smith commented on LUCENE-4557:
---

i understand your aversion to what i suggest, however i still argue this is a 
pretty nasty bug given that indexed content is lost

i also argue that it should be fully supported to change settings on fields as 
time goes on, especially the ability to make the field more general (add 
positions/offsets/insertnewfeaturehere). Old data would of course be limited to 
the settings the data was indexed with. However, new content should not be 
restricted to old settings.

Without supporting this, you are forcing full reindexes in situations that 
really should not require it.  This is a big red flag in my opinion.


from what i understand of your FilterReader suggestion, it would require me to 
do the equivalent of an index optimize in order to upgrade/convert the index 
to the have (0,0) offsets on segments that were lacking this setting?

This seems extremely expensive, and would require me to detect this situation 
at index startup time, and then spend very large amounts of time performing the 
conversion all blocking indexing from continuing until this operation is over.

Controlling this behavior at merge time seems to be the appropriate place.
As long as i could control the merge behavior via a pluggable/configurable API 
i would be happy, and  any other users that encounter this issue would also 
have a means to address it.
Looks like merging of segments data is not exposed at all, so right now there 
is no way to handle this situation properly.

For instance, if i could wrap the SegmentReader at merge time to provide null 
offsets that would be fine. Ideally, there would be some means to still support 
efficient bulk merging of stored fields/term vectors etc.







 Indexed Offsets Can Be Lost During Merge
 

 Key: LUCENE-4557
 URL: https://issues.apache.org/jira/browse/LUCENE-4557
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Tim Smith
 Attachments: OffsetsTest.java


 Primary Use case:
 Start with pre-4.0 index (no indexed offsets available)
 Start indexing new documents with indexed offsets 
 (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was 
 IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
 merge/optimize index
 newly indexed documents will now no longer have offsets available
 In general, it is impossible to ever change a field to have offsets indexed 
 when starting with an existing index as a merge will cause offsets to be 
 removed from the index.
 Desirable behavior would be for new documents to have offsets indexed 
 properly, and old documents would have offset of 0, 0 for all positions 
 after merging with a segment that contains offsets
 Current behavior can be very dangerous.
 for example:
 * Start indexing documents with indexed offsets
 * change config to not index offsets by accident
 * index 1 document
 * revert config back
 * offsets will start disappearing from documents as segments are merged

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge

2012-11-14 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497354#comment-13497354
 ] 

Tim Smith commented on LUCENE-4557:
---

i know you aren't changing your mind

i also disagree with calling this fake data
the data would be 100% representative of what was indexed

what i would at least like to see is a reasonable means to support this 
functionality.

I propose some means to support more pluggable segment merging:

for instance, if IndexWriter had the following method:
{code}
public AtomicReader getSegmentForMerge(SegmentReader reader) {
  return reader; // default implementation does nothing.
{code}

then i could override this method, wrap reader and enhance its indexed content 
as it is merging in order to fulfill my requirements.

This would have additional benefits including but not limited to:
* Supporting migration of IndexOptions on fields
* Supporting migration of sort fields from indexed fields to DocValues
* Support converting data types for DocValues
* and so on

This wrapping would just need to be smart (a good MergeSegmentReader base class 
that SegmentMerger is integrated with) in order to optimize bulk merges of 
stored fields/termvectors/etc

if this is a more palatable approach for you, i can work up a patch as i find 
time














 Indexed Offsets Can Be Lost During Merge
 

 Key: LUCENE-4557
 URL: https://issues.apache.org/jira/browse/LUCENE-4557
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Tim Smith
 Attachments: OffsetsTest.java


 Primary Use case:
 Start with pre-4.0 index (no indexed offsets available)
 Start indexing new documents with indexed offsets 
 (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was 
 IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
 merge/optimize index
 newly indexed documents will now no longer have offsets available
 In general, it is impossible to ever change a field to have offsets indexed 
 when starting with an existing index as a merge will cause offsets to be 
 removed from the index.
 Desirable behavior would be for new documents to have offsets indexed 
 properly, and old documents would have offset of 0, 0 for all positions 
 after merging with a segment that contains offsets
 Current behavior can be very dangerous.
 for example:
 * Start indexing documents with indexed offsets
 * change config to not index offsets by accident
 * index 1 document
 * revert config back
 * offsets will start disappearing from documents as segments are merged

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge

2012-11-14 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497459#comment-13497459
 ] 

Tim Smith commented on LUCENE-4557:
---

getSegmentForMerge could of course take AtomicReader to support addIndexes as 
well


CheckIndex validates indexed positions/offsets against term vectors?
isn't this really slow?


Also, if term vectors were indexed with offsets, and the positions did not have 
offsets, and offsets are being added to positions as part of the merge, i could 
easily have my MergeReader enhance the indexed positions offsets from the term 
vectors. 
Of course this would be a slower merge, but it would then have 100% the right 
data and not result in the corruption you allude to. This would then make term 
vectors consistent and suitable for bulk merge. (right now i don't have a use 
case that would have offsets indexed for both term vectors and positions (it'd 
be one or the other), but its helpful you pointed this issue out so i can make 
sure it would be handled properly in the future)


How about i look at working on a patch going down the pluggable segment data 
merging and we can iterate from there?






 Indexed Offsets Can Be Lost During Merge
 

 Key: LUCENE-4557
 URL: https://issues.apache.org/jira/browse/LUCENE-4557
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Tim Smith
 Attachments: OffsetsTest.java


 Primary Use case:
 Start with pre-4.0 index (no indexed offsets available)
 Start indexing new documents with indexed offsets 
 (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was 
 IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
 merge/optimize index
 newly indexed documents will now no longer have offsets available
 In general, it is impossible to ever change a field to have offsets indexed 
 when starting with an existing index as a merge will cause offsets to be 
 removed from the index.
 Desirable behavior would be for new documents to have offsets indexed 
 properly, and old documents would have offset of 0, 0 for all positions 
 after merging with a segment that contains offsets
 Current behavior can be very dangerous.
 for example:
 * Start indexing documents with indexed offsets
 * change config to not index offsets by accident
 * index 1 document
 * revert config back
 * offsets will start disappearing from documents as segments are merged

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4398) Memory Leak in TermsHashPerField memory tracking

2012-09-17 Thread Tim Smith (JIRA)

Tim Smith created LUCENE-4398:
-

 Summary: Memory Leak in TermsHashPerField memory tracking
 Key: LUCENE-4398
 URL: https://issues.apache.org/jira/browse/LUCENE-4398
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 3.4
Reporter: Tim Smith


I am witnessing an apparent leak in the memory tracking used to determine when 
a flush is necessary.

Over time, this will result in every single document being flushed into its own 
segment as the memUsage will remain above the configured buffer size, causing a 
flush to be triggered after every add/update.

Best I can figure, this is being caused by TermsHashPerField's tracking of 
memory usage for postingsHash and/or postingsArray combined with multi-threaded 
feeding.

I suspect that the TermsHashPerField's postingsHash is growing in one thread, 
then, when a segment is flushed, a single, different thread will merge all 
TermsHashPerFields in FreqProxTermsWriter and then call shrinkHash(). I suspect 
this call of shrinkHash() is seeing an old postingsHash array, and subsequently 
not releasing all the memory that was allocated.

If this is the case, I am also concerned that FreqProxTermsWriter will not 
write the correct terms into the index, although I have not confirmed that any 
indexing problem occurs as of yet.

NOTE: i am witnessing this growth in a test by subtracting the amount or memory 
allocated (but in a free state) by 
perDocAllocator/byteBlockAllocator/charBlocks/intBlocks from 
DocumentsWriter.memUsage.get() in IndexWriter.doAfterFlush()
I will see this stay at a stable point for a while, then on some flushes, i 
will see this grow by a couple of bytes, and all subsequent flushes will never 
go back down the the previous state


I will continue to investigate and post any additional findings












--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4398) Memory Leak in TermsHashPerField memory tracking

2012-09-17 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457151#comment-13457151
 ] 

Tim Smith commented on LUCENE-4398:
---

More information:

I started tracking the memory usage internally in TermsHashPerField
this resulted in an internal AtomicLong that held the amount of memory that was 
held by this class

i then added a finalize() method that dumped the memory held to stdout

result:
as soon as i witnessed the memory grow, i forced garbage collection (via 
yourkit profiler)
i then saw the finalize methods were called and the memory held by all garbage 
collected TermsHashPerField instances equaled the amount of memory that was 
leaked


looks like the DocumentsWriter is releasing thread states without freeing the 
bytesUsed()?

NOTE: this puts my concerns about thread safety/improper indexing to rest





 Memory Leak in TermsHashPerField memory tracking
 --

 Key: LUCENE-4398
 URL: https://issues.apache.org/jira/browse/LUCENE-4398
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 3.4
Reporter: Tim Smith

 I am witnessing an apparent leak in the memory tracking used to determine 
 when a flush is necessary.
 Over time, this will result in every single document being flushed into its 
 own segment as the memUsage will remain above the configured buffer size, 
 causing a flush to be triggered after every add/update.
 Best I can figure, this is being caused by TermsHashPerField's tracking of 
 memory usage for postingsHash and/or postingsArray combined with 
 multi-threaded feeding.
 I suspect that the TermsHashPerField's postingsHash is growing in one thread, 
 then, when a segment is flushed, a single, different thread will merge all 
 TermsHashPerFields in FreqProxTermsWriter and then call shrinkHash(). I 
 suspect this call of shrinkHash() is seeing an old postingsHash array, and 
 subsequently not releasing all the memory that was allocated.
 If this is the case, I am also concerned that FreqProxTermsWriter will not 
 write the correct terms into the index, although I have not confirmed that 
 any indexing problem occurs as of yet.
 NOTE: i am witnessing this growth in a test by subtracting the amount or 
 memory allocated (but in a free state) by 
 perDocAllocator/byteBlockAllocator/charBlocks/intBlocks from 
 DocumentsWriter.memUsage.get() in IndexWriter.doAfterFlush()
 I will see this stay at a stable point for a while, then on some flushes, i 
 will see this grow by a couple of bytes, and all subsequent flushes will 
 never go back down the the previous state
 I will continue to investigate and post any additional findings

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4398) Memory Leak in TermsHashPerField memory tracking

2012-09-17 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457196#comment-13457196
 ] 

Tim Smith commented on LUCENE-4398:
---

Looks like the culprit is DocFieldProcessorPerThread.trimFields()

this method releases fields that were not seen recently
for each field, this leaks 16 bytes from DocumentsWriter.bytesUsed's memory 
accounting

 Memory Leak in TermsHashPerField memory tracking
 --

 Key: LUCENE-4398
 URL: https://issues.apache.org/jira/browse/LUCENE-4398
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 3.4
Reporter: Tim Smith

 I am witnessing an apparent leak in the memory tracking used to determine 
 when a flush is necessary.
 Over time, this will result in every single document being flushed into its 
 own segment as the memUsage will remain above the configured buffer size, 
 causing a flush to be triggered after every add/update.
 Best I can figure, this is being caused by TermsHashPerField's tracking of 
 memory usage for postingsHash and/or postingsArray combined with 
 multi-threaded feeding.
 I suspect that the TermsHashPerField's postingsHash is growing in one thread, 
 then, when a segment is flushed, a single, different thread will merge all 
 TermsHashPerFields in FreqProxTermsWriter and then call shrinkHash(). I 
 suspect this call of shrinkHash() is seeing an old postingsHash array, and 
 subsequently not releasing all the memory that was allocated.
 If this is the case, I am also concerned that FreqProxTermsWriter will not 
 write the correct terms into the index, although I have not confirmed that 
 any indexing problem occurs as of yet.
 NOTE: i am witnessing this growth in a test by subtracting the amount or 
 memory allocated (but in a free state) by 
 perDocAllocator/byteBlockAllocator/charBlocks/intBlocks from 
 DocumentsWriter.memUsage.get() in IndexWriter.doAfterFlush()
 I will see this stay at a stable point for a while, then on some flushes, i 
 will see this grow by a couple of bytes, and all subsequent flushes will 
 never go back down the the previous state
 I will continue to investigate and post any additional findings

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4398) Memory Leak in TermsHashPerField memory tracking

2012-09-17 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457214#comment-13457214
 ] 

Tim Smith commented on LUCENE-4398:
---

Found a easy fix for this:
commenting out the bytesUsed(postingsHashSize * 
RamUsageEstimator.NUM_BYTES_INT) line from TermsHashPerField's constructor 
does the trick

This results in not accounting for 16 bytes for each field for each thread, 
this being the same 16 bytes that were not being reclaimed by trimFields()

I suppose a more robust means to fix this would be to add a destroy() method 
to the PerField interfaces that would release this memory (however that would 
be a rather large patch)


Also found a relatively easy way to reproduce this:
Feed N documents with fields A-M
force flush
Feed N documents with fields N-Z
force flush
Repeat

it will take a long time to actually consume all the memory (more fields used 
in test should accelerate things)









 Memory Leak in TermsHashPerField memory tracking
 --

 Key: LUCENE-4398
 URL: https://issues.apache.org/jira/browse/LUCENE-4398
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 3.4
Reporter: Tim Smith

 I am witnessing an apparent leak in the memory tracking used to determine 
 when a flush is necessary.
 Over time, this will result in every single document being flushed into its 
 own segment as the memUsage will remain above the configured buffer size, 
 causing a flush to be triggered after every add/update.
 Best I can figure, this is being caused by TermsHashPerField's tracking of 
 memory usage for postingsHash and/or postingsArray combined with 
 multi-threaded feeding.
 I suspect that the TermsHashPerField's postingsHash is growing in one thread, 
 then, when a segment is flushed, a single, different thread will merge all 
 TermsHashPerFields in FreqProxTermsWriter and then call shrinkHash(). I 
 suspect this call of shrinkHash() is seeing an old postingsHash array, and 
 subsequently not releasing all the memory that was allocated.
 If this is the case, I am also concerned that FreqProxTermsWriter will not 
 write the correct terms into the index, although I have not confirmed that 
 any indexing problem occurs as of yet.
 NOTE: i am witnessing this growth in a test by subtracting the amount or 
 memory allocated (but in a free state) by 
 perDocAllocator/byteBlockAllocator/charBlocks/intBlocks from 
 DocumentsWriter.memUsage.get() in IndexWriter.doAfterFlush()
 I will see this stay at a stable point for a while, then on some flushes, i 
 will see this grow by a couple of bytes, and all subsequent flushes will 
 never go back down the the previous state
 I will continue to investigate and post any additional findings

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4398) Memory Leak in TermsHashPerField memory tracking

2012-09-17 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457244#comment-13457244
 ] 

Tim Smith commented on LUCENE-4398:
---

NOTE: 16 bytes of unaccounted space in the postingsHash is actually much less 
than the object header fields require for TermsHashPerField

so, i would argue that not accounting for this 16 bytes is a valid low-profile 
fix to this

the only gotcha would be if trimFields() is ever called on a TermsHashPerField 
that has not been shrunk down to size due to a flush
is this possible?
even if possible, i expect this only occurs in the rare case of deep-down 
exceptions?
in that case, if abort() is called, i suppose the abort() method can be updated 
to shrink down the hash as well (if this is safe to do)



 Memory Leak in TermsHashPerField memory tracking
 --

 Key: LUCENE-4398
 URL: https://issues.apache.org/jira/browse/LUCENE-4398
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 3.4
Reporter: Tim Smith

 I am witnessing an apparent leak in the memory tracking used to determine 
 when a flush is necessary.
 Over time, this will result in every single document being flushed into its 
 own segment as the memUsage will remain above the configured buffer size, 
 causing a flush to be triggered after every add/update.
 Best I can figure, this is being caused by TermsHashPerField's tracking of 
 memory usage for postingsHash and/or postingsArray combined with 
 multi-threaded feeding.
 I suspect that the TermsHashPerField's postingsHash is growing in one thread, 
 then, when a segment is flushed, a single, different thread will merge all 
 TermsHashPerFields in FreqProxTermsWriter and then call shrinkHash(). I 
 suspect this call of shrinkHash() is seeing an old postingsHash array, and 
 subsequently not releasing all the memory that was allocated.
 If this is the case, I am also concerned that FreqProxTermsWriter will not 
 write the correct terms into the index, although I have not confirmed that 
 any indexing problem occurs as of yet.
 NOTE: i am witnessing this growth in a test by subtracting the amount or 
 memory allocated (but in a free state) by 
 perDocAllocator/byteBlockAllocator/charBlocks/intBlocks from 
 DocumentsWriter.memUsage.get() in IndexWriter.doAfterFlush()
 I will see this stay at a stable point for a while, then on some flushes, i 
 will see this grow by a couple of bytes, and all subsequent flushes will 
 never go back down the the previous state
 I will continue to investigate and post any additional findings

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3373) waitForMerges deadlocks if background merge fails

2011-08-23 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089674#comment-13089674
 ] 

Tim Smith commented on LUCENE-3373:
---

waitForMerges should continue to wait until all merges are complete (regardless 
of if they all end up failing)

i would suggest updating the MergeThread to catch all exceptions and allow 
processing the next merge. right now, any merge failure results in a 
ThreadDeath, which seems rather nasty. should probably just catch the exception 
and log a index trace message


 waitForMerges deadlocks if background merge fails
 -

 Key: LUCENE-3373
 URL: https://issues.apache.org/jira/browse/LUCENE-3373
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.0.3
Reporter: Tim Smith

 waitForMerges can deadlock if a merge fails for ConcurrentMergeScheduler
 this is because the merge thread will die, but pending merges are still 
 available
 normally, the merge thread will pick up the next merge once it finishes the 
 previous merge, but in the event of a merge exception, the pending work is 
 not resumed, but waitForMerges won't complete until all pending work is 
 complete
 i worked around this by overriding doMerge() like so:
 {code}
   protected final void doMerge(MergePolicy.OneMerge merge) throws IOException 
 {
 try {
   super.doMerge(merge);
 } catch (Throwable exc) {
   // Just logging the exception and not rethrowing
   // insert logging code here
 }
   }
 {code}
 Here's the rough steps i used to reproduce this issue:
 override doMerge like so
 {code}
   protected final void doMerge(MergePolicy.OneMerge merge) throws IOException 
 {
 try {Thread.sleep(500L);} catch (InterruptedException e) { }
 super.doMerge(merge);
 throw new IOException(fail);
   }
 {code}
 then, if you do the following:
 loop 50 times:
   addDocument // any doc
   commit
 waitForMerges // This will deadlock sometimes
 SOLR-2017 may be related to this (stack trace for deadlock looked related)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3373) waitForMerges deadlocks if background merge fails

2011-08-12 Thread Tim Smith (JIRA)

waitForMerges deadlocks if background merge fails
-

 Key: LUCENE-3373
 URL: https://issues.apache.org/jira/browse/LUCENE-3373
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.0.3
Reporter: Tim Smith


waitForMerges can deadlock if a merge fails for ConcurrentMergeScheduler

this is because the merge thread will die, but pending merges are still 
available

normally, the merge thread will pick up the next merge once it finishes the 
previous merge, but in the event of a merge exception, the pending work is not 
resumed, but waitForMerges won't complete until all pending work is complete

i worked around this by overriding doMerge() like so:
{code}
  protected final void doMerge(MergePolicy.OneMerge merge) throws IOException {
try {
  super.doMerge(merge);
} catch (Throwable exc) {
  // Just logging the exception and not rethrowing
  // insert logging code here
}
  }
{code}

Here's the rough steps i used to reproduce this issue:
override doMerge like so
{code}
  protected final void doMerge(MergePolicy.OneMerge merge) throws IOException {
try {Thread.sleep(500L);} catch (InterruptedException e) { }
super.doMerge(merge);
throw new IOException(fail);
  }
{code}

then, if you do the following:
loop 50 times:
  addDocument // any doc
  commit
waitForMerges // This will deadlock sometimes



SOLR-2017 may be related to this (stack trace for deadlock looked related)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2658) TestIndexWriterExceptions random failure: AIOOBE in ByteBlockPool.allocSlice

2010-09-21 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913097#action_12913097
 ] 

Tim Smith commented on LUCENE-2658:
---

Is this related to/same as LUCENE-2501?

 TestIndexWriterExceptions random failure: AIOOBE in ByteBlockPool.allocSlice
 

 Key: LUCENE-2658
 URL: https://issues.apache.org/jira/browse/LUCENE-2658
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.1, 4.0
Reporter: Robert Muir
Assignee: Michael McCandless
 Attachments: LUCENE-2658.patch, LUCENE-2658_environment.patch


 TestIndexWriterExceptions threw this today, and its reproducable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2658) TestIndexWriterExceptions random failure: AIOOBE in ByteBlockPool.allocSlice

2010-09-21 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913140#action_12913140
 ] 

Tim Smith commented on LUCENE-2658:
---

sadly i haven't been able to gather the infostream for LUCENE-2501
there's a comment on LUCENE-2501 that seems to indicate the exception that 
started it all though (CorruptIndexException: docs out of order (607 = 607 ))

 TestIndexWriterExceptions random failure: AIOOBE in ByteBlockPool.allocSlice
 

 Key: LUCENE-2658
 URL: https://issues.apache.org/jira/browse/LUCENE-2658
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.1, 4.0
Reporter: Robert Muir
Assignee: Michael McCandless
 Attachments: LUCENE-2658.patch, LUCENE-2658_environment.patch


 TestIndexWriterExceptions threw this today, and its reproducable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2276) Add IndexReader.document(int, Document, FieldSelector)

2010-07-14 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888323#action_12888323
 ] 

Tim Smith commented on LUCENE-2276:
---

instead of doing the following everywhere:
{code}
final Document doc(int n, FieldSelector fieldSelector) throws 
CorruptIndexException, IOException {
   return doc(n, null, fieldSelector); 
}
{code}

you could do:
{code}
final Document doc(int n, FieldSelector fieldSelector) throws 
CorruptIndexException, IOException {
   return doc(n, new Document(), fieldSelector); 
}
{code}

then, the interface for doc(int, Document, FieldSelector) can state that the 
document must not be null, and can skip the if null, new Document check later on



 Add IndexReader.document(int, Document, FieldSelector)
 --

 Key: LUCENE-2276
 URL: https://issues.apache.org/jira/browse/LUCENE-2276
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Tim Smith
 Attachments: LUCENE-2276.patch


 The Document object passed in would be populated with the fields identified 
 by the FieldSelector for the specified internal document id
 This method would allow reuse of Document objects when retrieving stored 
 fields from the index

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice

2010-06-23 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881675#action_12881675
 ] 

Tim Smith commented on LUCENE-2501:
---

I've been informed that this exception is still happening

however, whenever index tracing is turned on, it never seems to occur (extra 
logging seems to be preventing some lower level synchronization issue from 
surfacing)






 ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
 --

 Key: LUCENE-2501
 URL: https://issues.apache.org/jira/browse/LUCENE-2501
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
Reporter: Tim Smith

 I'm seeing the following exception during indexing:
 {code}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 14
 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118)
 at 
 org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490)
 at 
 org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120)
 at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468)
 at 
 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
 at 
 org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757)
 at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085)
 ... 37 more
 {code}
 This seems to be caused by the following code:
 {code}
 final int level = slice[upto]  15;
 final int newLevel = nextLevelArray[level];
 final int newSize = levelSizeArray[newLevel];
 {code}
 this can result in level being a value between 0 and 14
 the array nextLevelArray is only of size 10
 i suspect the solution would be to either max the level to 10, or to add more 
 entries to the nextLevelArray so it has 15 entries
 however, i don't know if something more is going wrong here and this is just 
 where the exception hits from a deeper issue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice

2010-06-16 Thread Tim Smith (JIRA)

ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
--

 Key: LUCENE-2501
 URL: https://issues.apache.org/jira/browse/LUCENE-2501
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
Reporter: Tim Smith


I'm seeing the following exception during indexing:
{code}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 14
at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118)
at 
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490)
at 
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511)
at 
org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104)
at 
org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468)
at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
at 
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085)
... 37 more
{code}


This seems to be caused by the following code:
{code}
final int level = slice[upto]  15;
final int newLevel = nextLevelArray[level];
final int newSize = levelSizeArray[newLevel];
{code}

this can result in level being a value between 0 and 14
the array nextLevelArray is only of size 10

i suspect the solution would be to either max the level to 10, or to add more 
entries to the nextLevelArray so it has 15 entries
however, i don't know if something more is going wrong here and this is just 
where the exception hits from a deeper issue



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice

2010-06-16 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879382#action_12879382
 ] 

Tim Smith commented on LUCENE-2501:
---

thats what i was afraid of

i got this report second hand, so i don't have access to the data that was 
being ingested

and i currently don't know enough about this section of the indexing code to 
guess in order to create a unit test
i'll try to create a test, but i expect it will be difficult (especially if no 
one else has ever seen this)

 ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
 --

 Key: LUCENE-2501
 URL: https://issues.apache.org/jira/browse/LUCENE-2501
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
Reporter: Tim Smith

 I'm seeing the following exception during indexing:
 {code}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 14
 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118)
 at 
 org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490)
 at 
 org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120)
 at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468)
 at 
 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
 at 
 org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757)
 at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085)
 ... 37 more
 {code}
 This seems to be caused by the following code:
 {code}
 final int level = slice[upto]  15;
 final int newLevel = nextLevelArray[level];
 final int newSize = levelSizeArray[newLevel];
 {code}
 this can result in level being a value between 0 and 14
 the array nextLevelArray is only of size 10
 i suspect the solution would be to either max the level to 10, or to add more 
 entries to the nextLevelArray so it has 15 entries
 however, i don't know if something more is going wrong here and this is just 
 where the exception hits from a deeper issue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice

2010-06-16 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879403#action_12879403
 ] 

Tim Smith commented on LUCENE-2501:
---

Here's all the info i have available right now (will try to get more):

16 core, 18-gig ram Windows 7 machine
1 JVM
16 index writers (each using default settings (64M ram, etc))
300+ docs/sec ingestion (small documents)
commit every 10 minutes
optimize every hour

The report i got indicated that every now and then one of these 
ArrayIndexOutOfBounds exceptions would occur
this would result in the document being indexed failing, but otherwise things 
would continue normally


 ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
 --

 Key: LUCENE-2501
 URL: https://issues.apache.org/jira/browse/LUCENE-2501
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
Reporter: Tim Smith

 I'm seeing the following exception during indexing:
 {code}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 14
 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118)
 at 
 org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490)
 at 
 org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120)
 at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468)
 at 
 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
 at 
 org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757)
 at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085)
 ... 37 more
 {code}
 This seems to be caused by the following code:
 {code}
 final int level = slice[upto]  15;
 final int newLevel = nextLevelArray[level];
 final int newSize = levelSizeArray[newLevel];
 {code}
 this can result in level being a value between 0 and 14
 the array nextLevelArray is only of size 10
 i suspect the solution would be to either max the level to 10, or to add more 
 entries to the nextLevelArray so it has 15 entries
 however, i don't know if something more is going wrong here and this is just 
 where the exception hits from a deeper issue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice

2010-06-16 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879422#action_12879422
 ] 

Tim Smith commented on LUCENE-2501:
---

Some more info:

ingestion is being performed in multiple threads

ArrayIndexOutOfBounds exception is occurring in bursts
I suspect that these bursts of exceptions stop after the next commit (at which 
point the buffers are all reset) 
NOTE: i have not yet confirmed this, but i suspect it


 ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
 --

 Key: LUCENE-2501
 URL: https://issues.apache.org/jira/browse/LUCENE-2501
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
Reporter: Tim Smith

 I'm seeing the following exception during indexing:
 {code}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 14
 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118)
 at 
 org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490)
 at 
 org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120)
 at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468)
 at 
 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
 at 
 org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757)
 at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085)
 ... 37 more
 {code}
 This seems to be caused by the following code:
 {code}
 final int level = slice[upto]  15;
 final int newLevel = nextLevelArray[level];
 final int newSize = levelSizeArray[newLevel];
 {code}
 this can result in level being a value between 0 and 14
 the array nextLevelArray is only of size 10
 i suspect the solution would be to either max the level to 10, or to add more 
 entries to the nextLevelArray so it has 15 entries
 however, i don't know if something more is going wrong here and this is just 
 where the exception hits from a deeper issue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice

2010-06-16 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879483#action_12879483
 ] 

Tim Smith commented on LUCENE-2501:
---

Looks like this may be the original source of the errors

{code}
Caused by: org.apache.lucene.index.CorruptIndexException: docs out of order 
(607 = 607 )
at 
org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:76)
at 
org.apache.lucene.index.FreqProxTermsWriter.appendPostings(FreqProxTermsWriter.java:209)
at 
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:127)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:144)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:72)
at 
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:64)
at 
org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:583)
at 
org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3602)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3511)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3502)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2103)
{code}

 ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
 --

 Key: LUCENE-2501
 URL: https://issues.apache.org/jira/browse/LUCENE-2501
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
Reporter: Tim Smith

 I'm seeing the following exception during indexing:
 {code}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 14
 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118)
 at 
 org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490)
 at 
 org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120)
 at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468)
 at 
 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
 at 
 org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757)
 at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085)
 ... 37 more
 {code}
 This seems to be caused by the following code:
 {code}
 final int level = slice[upto]  15;
 final int newLevel = nextLevelArray[level];
 final int newSize = levelSizeArray[newLevel];
 {code}
 this can result in level being a value between 0 and 14
 the array nextLevelArray is only of size 10
 i suspect the solution would be to either max the level to 10, or to add more 
 entries to the nextLevelArray so it has 15 entries
 however, i don't know if something more is going wrong here and this is just 
 where the exception hits from a deeper issue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2501) ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice

2010-06-16 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879489#action_12879489
 ] 

Tim Smith commented on LUCENE-2501:
---

will do

may take some time before it occurs again

also, if this boils down to a synchronization error of some sort, the extra 
file io done to write the trace info to disk may add some implicit 
synchronization/slowdown that may result in not being able to reproduce the 
issue (i've seen this occur on non-lucene related synchronization issues, add 
the extra debug logging and it never fails anymore)

 ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice
 --

 Key: LUCENE-2501
 URL: https://issues.apache.org/jira/browse/LUCENE-2501
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
Reporter: Tim Smith

 I'm seeing the following exception during indexing:
 {code}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 14
 at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118)
 at 
 org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490)
 at 
 org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120)
 at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468)
 at 
 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
 at 
 org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757)
 at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085)
 ... 37 more
 {code}
 This seems to be caused by the following code:
 {code}
 final int level = slice[upto]  15;
 final int newLevel = nextLevelArray[level];
 final int newSize = levelSizeArray[newLevel];
 {code}
 this can result in level being a value between 0 and 14
 the array nextLevelArray is only of size 10
 i suspect the solution would be to either max the level to 10, or to add more 
 entries to the nextLevelArray so it has 15 entries
 however, i don't know if something more is going wrong here and this is just 
 where the exception hits from a deeper issue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-15 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857375#action_12857375
 ] 

Tim Smith commented on LUCENE-2324:
---

bq. But... could we allow an add/updateDocument call to express this affinity, 
explicitly?

i would love to be able to explicitly define a segment affinity for documents 
i'm feeding

this would then allow me to say: 
all docs from table a has affinity 1
all docs from table b has affinity 2

this would ideally result in indexing documents from each table into a 
different segment (obviously, i would then need to be able to have segment 
merging be affinity aware so optimize/merging would only merge segments that 
share an affinity)

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1

 Attachments: lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-15 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857385#action_12857385
 ] 

Tim Smith commented on LUCENE-2324:
---

bq. Probably if you really want to keep the segments segregated like that, you 
should in fact index to separate indices?

Thats what i'm currently thinking i'll have to do

however it would be ideal if i could either subclass IndexWriter or use 
IndexWriter directly with this affinity concept (potentially writing my own 
segment merger that is affinity aware)
that makes it so i can easily use near real time indexing, as only one 
IndexWriter will be in the mix, as well as make managing deletes and a whole 
other host of issues with multiple indexes disappear
Also makes it so i can configure memory settings across all affinity groups 
instead of having to dynamically create them, each with their own memory bounds

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1

 Attachments: lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2071) Allow updating of IndexWriter SegmentReaders

2010-03-30 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851388#action_12851388
 ] 

Tim Smith commented on LUCENE-2071:
---

+1

I have a special subclassed IndexSearcher that certain special queries require, 
so IndexWriter's delete by query will fail as an IndexSearcher is passed in

With this added method, i would be able to construct my own Searcher over the 
readers and then apply deletes properly

This would also allow counting the deletes as they occur as well (which is 
commonly desired when deleting by query)


It would be nice if this method would also work with non-pooled readers

so my desired method signature would be:
void updateReaders(Readers callback, boolean pooled)

if the readers were already pooled, this would have no effect, otherwise it 
would just open the segment readers just like the non-pooled delete readers are 
opened

 Allow updating of IndexWriter SegmentReaders
 

 Key: LUCENE-2071
 URL: https://issues.apache.org/jira/browse/LUCENE-2071
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2071.patch


 This discussion kind of started in LUCENE-2047.  Basically, we'll allow users 
 to perform delete document, and norms updates on SegmentReaders that are 
 handled by IndexWriter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2071) Allow updating of IndexWriter SegmentReaders

2010-03-30 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851528#action_12851528
 ] 

Tim Smith commented on LUCENE-2071:
---

found a couple of small issues with the patch attached to this ticket:

1. applyDeletes issue

saw this was in another ticket

think the flush should be flush(true, true, false)
and applyDeletes() should be called in the synchronized block


2. IndexWriter.changeCount not updated

the call() method does not return a boolean indicating if there were any 
changes that would need to be committed

as a result, if no other changes are made to the indexwriter, the commit will 
be skipped, even though deletes/norm updates were sent in
IndexReader.reopen() will then return the old reader without the deletes/norms



 Allow updating of IndexWriter SegmentReaders
 

 Key: LUCENE-2071
 URL: https://issues.apache.org/jira/browse/LUCENE-2071
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2071.patch


 This discussion kind of started in LUCENE-2047.  Basically, we'll allow users 
 to perform delete document, and norms updates on SegmentReaders that are 
 handled by IndexWriter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-26 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850127#action_12850127
 ] 

Tim Smith commented on LUCENE-2345:
---

bq. I think we should only commit this only on 3.1 (new feature)? 

3.1 only of course (just posted a 3.0 patch now as that's what i'm using and i 
need the functionality now)

bq. Tim, do you think the plugin model (extension by composition) would be 
workable for your use case? Ie, instead of a factory enabling subclasses of 
SegmentReader?

As long as the plugin model allows the same capabilities, that could work just 
fine and could be the final solution for this ticket.

I mainly need the ability to add data structures to a SegmentReader that will 
be shared for all SegmentReader's for a segment, and then add some extra meta 
information on a per instance basis

Is there a ticket or wiki page that details the plugin architecture/design so 
i could take a look?

However, would the plugins allow overriding specific IndexReader methods?

I still would see the need to be able to override specific methods for a 
SegmentReader (in order to track statistics/provide 
changed/different/faster/more feature rich implementations)
I don't have a direct need for this right now, however i could envision needing 
this in the future

Here's a few requirements i would pose for the plugin model (maybe they are 
already though of):
* Plugins have hooks to reopen themselves (some plugins can be shared across 
all instances of a SegmentReader)
** These reopen hooks would be called during SegmentReader.reopen()
* Plugins are initialized during SegmentReader.get/SegmentReader.reopen
** plugins should not have to be added after the fact, as this would not allow 
proper warming/initializing of plugins inside the NRT indexing
** i assume this would need be added as some list of PluginFactories added to 
the IndexWriter/IndexReader.open()?
* Plugins should have a close method that is called in SegmentReader.close()
** This will allow proper release of any resources
* Plugins are passed an instance of the SegmentReader they are for
** Plugins should be able to access all methods on a SegmentReader
** This would effectively allow overriding a SegmentReader by having a plugin 
provide the functionality instead (however only people explicitly calling the 
plugin would get this benefit)





 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-26 Thread Tim Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-2345:
--

Attachment: LUCENE-2345_3.0.plugins.patch

Here's a patch (again, against 3.0) showing the minimal API i would like to see 
from the plugin model

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-26 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850323#action_12850323
 ] 

Tim Smith commented on LUCENE-2345:
---

found one issue with the plugins patch

With NRT indexing, if the SegmentReader is opened with no TermInfosReader (for 
merging), then the plugins will be initialized with a SegmentReader that has no 
ability to walk the TermsEnum.

I guess SegmentPlugin initialization should wait until after the terms index is 
loaded or have another method for catching this event to the SegmentPlugin 
interface


 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-26 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850361#action_12850361
 ] 

Tim Smith commented on LUCENE-2345:
---

bq. My patch removes loadTermsIndex method from SegmentReader and requires you 
to reopen it. 

that's definitely much cleaner and would solve the issue in my current patch 
(sadly i'm on 3.0 and want to keep my patch there at a minimum until i can port 
to all the goodness on 3.1).

bq. Also, they extend not only SegmentReader, but the whole hierarchy - SR, MR, 
DR, whatever.

i just wussed out and just did only the SegmentReader case as thats all i need 
right now

bq. as all the hooks are on the factory classes

could you post your factory class interface?
If i base my 3.0 patch off that i can reduce my 3.1 port overhead.


are there any tickets tracking your reopen refactors or your plugin model?
If not, feel free to retool this ticket for your plugin model for Index Readers 
as that will solve my use cases (and then some)

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-25 Thread Tim Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-2345:
--

Attachment: LUCENE-2345_3.0.patch

Here's a patch against 3.0 that provides the SegmentReaderFactory ability
(not tested yet, but i'll be doing that shortly as i integrate this 
functionality)

It adds a SegmentReaderFactory.

The IndexWriter now has a getter and setter for setting this

SegmentReader has a new protected method init() which is called after the 
segment reader has been initialized (to allow subclasses to hook this action 
and do additional initialization, etc

added 2 new IndexReader.open() calls that allow specifying the 
SegmentReaderFactory



 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-25 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849731#action_12849731
 ] 

Tim Smith commented on LUCENE-2345:
---

that was my plan

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-24 Thread Tim Smith (JIRA)

Make it possible to subclass SegmentReader
--

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1


I would like the ability to subclass SegmentReader for numerous reasons:
* to capture initialization/close events
* attach custom objects to an instance of a segment reader (caches, statistics, 
so on and so forth)
* override methods on segment reader as needed

currently this isn't really possible

I propose adding a SegmentReaderFactory that would allow creating custom 
subclasses of SegmentReader

default implementation would be something like:
{code}
public class SegmentReaderFactory {
  public SegmentReader get(boolean readOnly) {
return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
  }

  public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
return newSegmentReader(readOnly);
  }
}
{code}

It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
(for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
etc)

I could prepare a patch if others think this has merit

Obviously, this API would be experimental/advanced/will change in future




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2010-03-24 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849358#action_12849358
 ] 

Tim Smith commented on LUCENE-1821:
---

This would actually be solved by LUCENE-2345 for me as i would then be able to 
tag SegmentReaders with any additional accounting information i would need

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-24 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849455#action_12849455
 ] 

Tim Smith commented on LUCENE-2345:
---

that's the reassurance i needed :)

will start working on a patch tomorrow 
will take a few days as i'll start with a 3.0 patch (which i use), then will 
create a 3.1 patch once i've got that all flushed out

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-24 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849497#action_12849497
 ] 

Tim Smith commented on LUCENE-2345:
---

i'll do my initial work on 3.0 so i can absorb the changes now and will post 
that patch

at which point, i can wait for you to finish whatever you need, or we can just 
incorporate the same ability into your patch for the other ticket
i would just like to see the ability to subclass SegmentReader's on 3.1 so i 
don't have to port a patch when i absorb 3.1 (just use the finalized apis)



 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844930#action_12844930
 ] 

Tim Smith commented on LUCENE-2310:
---

Personally, i like keeping Fieldable, (or having AbstractField just with 
abstract methods and no actual implementation)

for feeding documents, i use custom Fieldable implementations to reduce amount 
of setters called, as Fields of different types have different constant settings

 Reduce Fieldable, AbstractField and Field complexity
 

 Key: LUCENE-2310
 URL: https://issues.apache.org/jira/browse/LUCENE-2310
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Index
Reporter: Chris Male
 Attachments: LUCENE-2310-Deprecate-AbstractField.patch


 In order to move field type like functionality into its own class, we really 
 need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
 Currently AbstractField depends on Field, and does not provide much more 
 functionality that storing fields, most of which are being moved over to 
 FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
 possible Fieldable), moving much of the functionality into Field and 
 FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-03-01 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839682#action_12839682
 ] 

Tim Smith commented on LUCENE-2283:
---

i haven't been able to fully replicate this issue in a unit test scenario, 

however it will definitely resolve that 40M of ram that was allocated and never 
released for the RAMFiles on the StoredFieldsWriter (keeping that bound to the 
configured memory size)

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2283.patch, LUCENE-2283.patch, LUCENE-2283.patch


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-26 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838976#action_12838976
 ] 

Tim Smith commented on LUCENE-2283:
---

I'll work up another patch

might take me a few minutes to get my head wrapped around the 
TermVectorsTermsWriter stuff

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2283.patch


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-26 Thread Tim Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-2283:
--

Attachment: LUCENE-2283.patch

Here's a new patch with your suggestions

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2283.patch, LUCENE-2283.patch


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-25 Thread Tim Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-2283:
--

Attachment: LUCENE-2283.patch

Here's a patch for using a pool for stored fields buffers


 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2283.patch


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837793#action_12837793
 ] 

Tim Smith commented on LUCENE-2283:
---

I came across this issue looking for a reported memory leak during indexing

a yourkit snapshot showed that the PerDocs for an IndexWriter were using ~40M 
of memory (at which point i came across this potentially unbounded memory use 
in StoredFieldsWriter)
this snapshot seems more or less at a stable point (memory grows but then 
returns to a normal state), however i have reports that eventually the memory 
is completely exhausted resulting in out of memory errors.

I so far have not found any other major culprit in the lucene indexing code.

This index receives a routine mix of very large and very small documents (which 
would explain this situation)
The VM and system have more than ample amount of memory given the buffer size 
and what should be normal indexing RAM requirements.

Also, a major difference between this leak not occurring and it showing up is 
that previously, the IndexWriter was closed when performing commits, now the 
IndexWriter remains open (just calling IndexWriter.commit()). So, if any memory 
is leaking during indexing, it is no longer being reclaimed during commit. As a 
side note, closing the index writer at commit time would sometimes fail, 
resulting in some following updates to fail because the index writer was locked 
and couldn't be reopened until the old index writer was garbage collected, so i 
don't want to go back to this for commits.

Its possible there is a leak somewhere else (i currently do not have a snapshot 
right before out of memory issues occur, so currently the only thing that 
stands out is the PerDoc memory use)

As far as a fix goes, wouldn't it be better to have the RAMFile's used for 
stored fields pull and return byte buffers from the byte block pool on the 
DocumentsWriter? This would allow the memory to be reclaimed based on the index 
writers buffer size (otherwise there is no configurable way to tune this memory 
use)



 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837821#action_12837821
 ] 

Tim Smith commented on LUCENE-2283:
---

ramBufferSizeMB is 64MB

Here's the yourkit breakdown per class:
* DocumentsWriter - 256 MB
** TermsHash - 38.7 MB
** StoredFieldsWriter - 37.5 MB
** DocumentsWriterThreadState - 36.2 MB
** DocumentsWriterThreadState - 34.6 MB
** DocumentsWriterThreadState - 33.8 MB
** DocumentsWriterThreadState - 27.5 MB
** DocumentsWriterThreadState - 13.4 MB

I'm starting to dig into the ThreadStates now to see if anything stands out here

bq. Hmm, that makes me nervous, because I think in this case the use should be 
bounded.

I should be getting a new profile dump at crash time soon, so hopefully that 
will make things clearer

bq. That doesn't sound good! Can you post some details on this (eg an 
exception)?

If i recall correctly, I think the exception was caused by an out of disk space 
situation (which would recover)
obviously not much that can be done about this other than adding more disk 
space, however the situation would recover, but docs would be lost in the 
interum

bq. But, anyway, keeping the same IW open and just calling commit is (should 
be) fine.

Yeah, this should be the way to go, especially as it results in the pooled 
buffers not needing to be reallocated/reclaimed/etc, however right now this is 
the only change i can currently think of that could result in memory issues.

bq. Yes, that's a great solution - a single pool. But that's a somewhat bigger 
change. 

Seems like this would be the best approach as it makes the memory bounded by 
the configuration of the engine, giving better reuse of byte blocks and better 
ability to reclaim memory (in DocumentsWriter.balanceRAM())




 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837875#action_12837875
 ] 

Tim Smith commented on LUCENE-2283:
---

bq. I agree. I'll mull over how to do it... unless you're planning on consing 
up a patch 

I'd love to, but don't have the free cycles at the moment :(

bq. How many threads do you pass through IW?

honestly don't 100% know about the origin of the threads i'm given
In general, they should be from a static pool, but may be dynamically allocated 
if the static pool runs out

One thought i had recently was to control this more tightly by having a limited 
number of static threads that called IndexWriter methods in case that was the 
issue (but that would be a pretty big change)

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837881#action_12837881
 ] 

Tim Smith commented on LUCENE-2283:
---

latest profile dump has pointed out a non-lucene issue as causing some memory 
growth

so feel free to drop down priority

however it seems like using the bytepool for the stored fields would be good 
overall

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837919#action_12837919
 ] 

Tim Smith commented on LUCENE-2283:
---

another note is that this was on 64 bit vm

i've noticed that all the memsize calculations assume 4 byte pointers, so 
perhaps that can lead to more memory being used that would otherwise be 
expected (although 256 MB is still well over the 2X mem use that would 
potentially be expected in that case)



 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838017#action_12838017
 ] 

Tim Smith commented on LUCENE-2283:
---

i'm working up a patch for the shared byteblock pool for stored field buffers 
(found a few cycles)


 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-23 Thread Tim Smith (JIRA)

Possible Memory Leak in StoredFieldsWriter
--

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith


StoredFieldsWriter creates a pool of PerDoc instances

this pool will grow but never be reclaimed by any mechanism

furthermore, each PerDoc instance contains a RAMFile.
this RAMFile will also never be truncated (and will only ever grow) (as far as 
i can tell)

When feeding documents with large number of stored fields (or one large 
dominating stored field) this can result in memory being consumed in the 
RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
large, even if large documents are rare.

Seems like there should be some attempt to reclaim memory from the PerDoc[] 
instance pool (or otherwise limit the size of RAMFiles that are cached) etc


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2276) Add IndexReader.document(int, Document, FieldSelector)

2010-02-22 Thread Tim Smith (JIRA)

Add IndexReader.document(int, Document, FieldSelector)
--

 Key: LUCENE-2276
 URL: https://issues.apache.org/jira/browse/LUCENE-2276
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Tim Smith


The Document object passed in would be populated with the fields identified by 
the FieldSelector for the specified internal document id

This method would allow reuse of Document objects when retrieving stored fields 
from the index



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader

2009-12-15 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790803#action_12790803
 ] 

Tim Smith commented on LUCENE-1923:
---

added getName() in case anyone is currently relying on current (default) output 
from toString() on index readers

feel free to rename the getName() methods to toString()

 Add toString() or getName() method to IndexReader
 -

 Key: LUCENE-1923
 URL: https://issues.apache.org/jira/browse/LUCENE-1923
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
Assignee: Michael McCandless
 Attachments: LUCENE-1923.patch


 It would be very useful for debugging if IndexReader either had a getName() 
 method, or a toString() implementation that would get a string identification 
 for the reader.
 for SegmentReader, this would return the same as getSegmentName()
 for Directory readers, this would return the generation id?
 for MultiReader, this could return something like multi(sub reader name, sub 
 reader name, sub reader name, ...)
 right now, i have to check instanceof for SegmentReader, then call 
 getSegmentName(), and for all other IndexReader types, i would have to do 
 something like get the IndexCommit and get the generation off it (and this 
 may throw UnsupportedOperationException, at which point i have would have to 
 recursively walk sub readers and try again)
 I could work up a patch if others like this idea

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader

2009-12-08 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787472#action_12787472
 ] 

Tim Smith commented on LUCENE-1923:
---

i won't have the time till after the new year.

if someone else wants to work up a patch, go for it (this seems simple enough 
and adds some nice info capabilities for logging/etc), otherwise, i'll get to 
it when i can

 Add toString() or getName() method to IndexReader
 -

 Key: LUCENE-1923
 URL: https://issues.apache.org/jira/browse/LUCENE-1923
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith

 It would be very useful for debugging if IndexReader either had a getName() 
 method, or a toString() implementation that would get a string identification 
 for the reader.
 for SegmentReader, this would return the same as getSegmentName()
 for Directory readers, this would return the generation id?
 for MultiReader, this could return something like multi(sub reader name, sub 
 reader name, sub reader name, ...)
 right now, i have to check instanceof for SegmentReader, then call 
 getSegmentName(), and for all other IndexReader types, i would have to do 
 something like get the IndexCommit and get the generation off it (and this 
 may throw UnsupportedOperationException, at which point i have would have to 
 recursively walk sub readers and try again)
 I could work up a patch if others like this idea

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1923) Add toString() or getName() method to IndexReader

2009-12-08 Thread Tim Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-1923:
--

Attachment: LUCENE-1923.patch

Here's a simple patch to get the ball rolling

This adds a getName() method to IndexReader

the default implementation will be:
SimleClassName(subreader.getName(), subreader.getName(), ...)

SegmentReader will return same value as getSegmentName()

DirectoryReader will return:
DirectoryReader(segment_N, segment.getName(), segment.getName(), ...)

ParallelReader will return:
ParallelReader(parallelReader1.getName(), parallelReader2.getName(), ...)

this currently does not have a toString() implementation return getName() 

do with this patch as you will


 Add toString() or getName() method to IndexReader
 -

 Key: LUCENE-1923
 URL: https://issues.apache.org/jira/browse/LUCENE-1923
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Attachments: LUCENE-1923.patch


 It would be very useful for debugging if IndexReader either had a getName() 
 method, or a toString() implementation that would get a string identification 
 for the reader.
 for SegmentReader, this would return the same as getSegmentName()
 for Directory readers, this would return the generation id?
 for MultiReader, this could return something like multi(sub reader name, sub 
 reader name, sub reader name, ...)
 right now, i have to check instanceof for SegmentReader, then call 
 getSegmentName(), and for all other IndexReader types, i would have to do 
 something like get the IndexCommit and get the generation off it (and this 
 may throw UnsupportedOperationException, at which point i have would have to 
 recursively walk sub readers and try again)
 I could work up a patch if others like this idea

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big

2009-12-07 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786921#action_12786921
 ] 

Tim Smith commented on LUCENE-1859:
---

close if you like

application writers can add guards for this if they like/need to as a custom 
TokenFilter

mainly created this ticket as this can result in an unbound buffer should 
people use the token stream api incorrectly (or against suggestions of lucene 
core developers)

 TermAttributeImpl's buffer will never shrink if it grows too big
 --

 Key: LUCENE-1859
 URL: https://issues.apache.org/jira/browse/LUCENE-1859
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor

 This was also an issue with Token previously as well
 If a TermAttributeImpl is populated with a very long buffer, it will never be 
 able to reclaim this memory
 Obviously, it can be argued that Tokenizer's should never emit large 
 tokens, however it seems that the TermAttributeImpl should have a reasonable 
 static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, 
 it will shrink back down to this size once the next token smaller than 
 MAX_BUFFER_SIZE is set
 I don't think i have actually encountered issues with this yet, however it 
 seems like if you have multiple indexing threads, you could end up with a 
 char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
 perhaps growTermBuffer should have the logic to shrink if the buffer is 
 currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order

2009-11-23 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781615#action_12781615
 ] 

Tim Smith commented on LUCENE-2086:
---

Got some performance numbers:

Description of test (NOTE: this is representative of actions that may occur in 
a running system (not a contrived test)):
* feed 4 million operations (3/4 are deletes, 1/4 are updates (single field))
* commit
* feed 1 million updates (about 1/3 are updates, 2/3/ deletes (randomly 
selected))
* commit

Numbers:
|| Desc || Old || New ||
| feed 4 million | 56914ms | 15698ms |
| commit 4 million | 9072ms | 14291ms |
| total (4 million) | 65986ms | 29989ms | 
| update 1 million | 46096ms | 11340ms |
| commit 1 million | 13501ms | 9273ms | 
| total (1 million) | 59597ms | 20613ms |

This shows significant improvements with new patched data (1/3 the time for 1 
million, about 1/2 the time for initial 4 million feed)

This means i'm gonna definitely need to incorporate this patch while i'm still 
on 3.0 (will upgrade to 3.0 as soon as its out, then apply this fix) 
Ideally, a 3.0.1 would be forthcoming in the next month or so with this fix so 
i wouldn't have to maintain this patched overlay of code






 When resolving deletes, IW should resolve in term sort order
 

 Key: LUCENE-2086
 URL: https://issues.apache.org/jira/browse/LUCENE-2086
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2086.patch


 See java-dev thread IndexWriter.updateDocument performance improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order

2009-11-20 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780698#action_12780698
 ] 

Tim Smith commented on LUCENE-2086:
---

any chance this can go into 3.0.0 or a 3.0.1?


 When resolving deletes, IW should resolve in term sort order
 

 Key: LUCENE-2086
 URL: https://issues.apache.org/jira/browse/LUCENE-2086
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2086.patch


 See java-dev thread IndexWriter.updateDocument performance improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order

2009-11-20 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780701#action_12780701
 ] 

Tim Smith commented on LUCENE-2086:
---

i've seen the deletes dominating commit time quite often, so obviously it would 
be very useful to be able to absorb this optimization sooner than later (whats 
the timeframe for 3.1?)

otherwise i'll have to override the classes involved and pull in this patch 
(never like this approach myself)

 When resolving deletes, IW should resolve in term sort order
 

 Key: LUCENE-2086
 URL: https://issues.apache.org/jira/browse/LUCENE-2086
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2086.patch


 See java-dev thread IndexWriter.updateDocument performance improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order

2009-11-20 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780710#action_12780710
 ] 

Tim Smith commented on LUCENE-2086:
---

bq. maybe try it  report back?

i'll see if i can find some cycles to try this against the most painful use 
case i have

bq. I'd rather see us release a 3.1 sooner rather than later, instead.

yes please.
I would definitely like to see a more accelerated release cycle (even if less 
functionality gets into each minor release)

 When resolving deletes, IW should resolve in term sort order
 

 Key: LUCENE-2086
 URL: https://issues.apache.org/jira/browse/LUCENE-2086
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2086.patch


 See java-dev thread IndexWriter.updateDocument performance improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public

2009-11-12 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777008#action_12777008
 ] 

Tim Smith commented on LUCENE-1909:
---

I have the following use case:

i have a configuration bean, this bean can be customized via xml at config time
in this bean, i expose the setting for the terms index divisor
so, my bean has to have a default value for this,

right now, i just use 1 for the default value.
would be nice if i could just use the lucene constant instead of using 1, as 
the lucene constant could change in the future (not really likely, but its one 
less constant i have to maintain)

if the default is not made public i have 2 options:
# use a hard coded constant in my code for the default value (doing this right 
now)
# use an Integer object, and have null be the default

the nasty part about the second option is that i now have to do conditional 
opening of the reader depending on if null is the value (unset), when it would 
be much simpler (and easier for me to maintain), if i just always pass in that 
value


 Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
 ---

 Key: LUCENE-1909
 URL: https://issues.apache.org/jira/browse/LUCENE-1909
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Uwe Schindler
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE_1909.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

1 2 3 >

1 - 100 of 242 matches

Mail list logo