Re: Welcome Stefan Vodita as Lucene committter

2024-01-18 Thread Shai Erera
Welcome Stefan!

On Thu, Jan 18, 2024, 18:21 Dawid Weiss  wrote:

>
> Welcome, Stefan!
> Dawid
>
> On Thu, Jan 18, 2024 at 4:54 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hi Team,
>>
>> I'm pleased to announce that Stefan Vodita has accepted the Lucene PMC's
>> invitation to become a committer!
>>
>> Stefan, the tradition is that new committers introduce themselves with a
>> brief bio.
>>
>> Congratulations, welcome, and thank you for all your improvements to
>> Lucene and our community,
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>


Re: Index ordinal data in the taxonomy

2023-05-13 Thread Shai Erera
Hi

> There's two approaches we could take initially,

Both approaches look fine to me. As long as we expose the right API. I
assume that if we use updatable DV, then we'll have a proper API on
TaxoWrite to update the fields, but otherwise (if we'll only allow updating
during Taxo rewrite) we won't have any update API. Another option is to
allow these rewrites during taxonomy merges, something we can think about.

> Yes, we've considered things like a local database or a separate index.

Another approach is to treat this like a rescore query: you aggregate the
facets without their signals and then rescore the top-K (100, 1000, 1)
facets according to external signals. Just another idea to think about
(yes, it's not perfect, but it might work OK-ish?)

Shai

On Sat, May 13, 2023 at 6:45 PM Stefan Vodita 
wrote:

> Hello Shai,
>
> Thank you for the feedback! I'll try to answer each of the questions.
>
> > will it change the API in non-backward compatible way, or impact faceted
> search performance for the common case?
>
> The new API could overload FacetsConfig.build or provide a new method in
> TaxonomyWriter to plug in ordinal data. It doesn't have to change the
> functionality that already exists. A taxonomy index in the common case
> would be
> indistinguishable before and after this change.
>
> > Do you intend to support arbitrary signals, or only numeric ones?
>
> This is a crucial question. I'd like to take one small step forward and
> leave
> room for us to make improvements later. There's two approaches we could
> take
> initially, which I think you've already identified in your email:
>
> 1. Allow only updatabe DocValues as ordinal data. This could become
> limiting at
> some point, but maybe it's a good first solution.
>
> 2. Disallow updating ordinal data. New ordinal data can only come in when
> a new
> taxonomy gets built.
>
> For the Amazon product search use case, option 2 is slightly better. We
> would
> build new indexes more often than we would get ordinal data updates. But
> I'm
> not sure what the better option is in the general case. This is where I'd
> like
> feedback from other users. Maybe there's also some other approach I haven't
> thought of.
>
> > Have you considered an alternative implementation of pulling that info
> from another source during retrieval?
>
> Yes, we've considered things like a local database or a separate index.
> I haven't done a performance test, but my guess is that having the ordinal
> data in the taxonomy is as fast as it gets for use-cases like the faceting
> aggregation example in my previous email. Even if that isn't the case, the
> taxonomy solution is more convenient and less burdensome from an
> operational
> standpoint.
>
>
> I hope that's useful. Thanks again for the feedback,
>
> Stefan
>
> On Thu, 11 May 2023 at 16:53, Shai Erera  wrote:
> >
> > Hi Stefan,
> >
> > This sounds interesting and useful. It's like static scores for Lucene
> documents, only that we will apply them to ordinals. Since I assume it's
> not a very common use case though, do you know if this new functionality
> affects existing use cases? For example, will it change the API in
> non-backward compatible way, or impact faceted search performance for the
> common case?
> >
> > Do you intend to support arbitrary signals, or only numeric ones?
> Numeric signals will allow you to efficiently update the taxonomy index's
> ordinal documents without updating the documents themselves (which will
> change their ordinal!!). Other signals don't support this sort of update
> (yet), so you might run into the issue of not being able to update them.
> And at least for the author-citation-signal, that's definitely something
> you'll want to update (unless you rebuild the index from time to time, when
> the signals are updated).
> >
> > Have you considered an alternative implementation of pulling that info
> from another source during retrieval? Just curious what would be the
> performance implications, since an alternative source can give you the
> flexibility of supporting other signals which are more complicated to
> update, but won't affect the taxonomy index.
> >
> > Generally though, I don't see a reason not to support it.
> >
> > Shai
> >
> > On Thu, May 11, 2023 at 1:03 PM Stefan Vodita 
> wrote:
> >>
> >> Hi everyone,
> >>
> >> I work on the Lucene product search team at Amazon. We’ve been
> considering
> >> indexing scoring signals for ordinals into the taxonomy, which could
> reduce
> >> index size for some use-cases.
> >>
> >> Example
> >>
> >> Let's consider a library of re

Re: Index ordinal data in the taxonomy

2023-05-11 Thread Shai Erera
Hi Stefan,

This sounds interesting and useful. It's like static scores for Lucene
documents, only that we will apply them to ordinals. Since I assume it's
not a very common use case though, do you know if this new functionality
affects existing use cases? For example, will it change the API in
non-backward compatible way, or impact faceted search performance for the
common case?

Do you intend to support arbitrary signals, or only numeric ones? Numeric
signals will allow you to efficiently update the taxonomy index's ordinal
documents without updating the documents themselves (which will change
their ordinal!!). Other signals don't support this sort of update (yet), so
you might run into the issue of not being able to update them. And at least
for the author-citation-signal, that's definitely something you'll want to
update (unless you rebuild the index from time to time, when the signals
are updated).

Have you considered an alternative implementation of pulling that info from
another source during retrieval? Just curious what would be the performance
implications, since an alternative source can give you the flexibility of
supporting other signals which are more complicated to update, but won't
affect the taxonomy index.

Generally though, I don't see a reason not to support it.

Shai

On Thu, May 11, 2023 at 1:03 PM Stefan Vodita 
wrote:

> Hi everyone,
>
> I work on the Lucene product search team at Amazon. We’ve been considering
> indexing scoring signals for ordinals into the taxonomy, which could reduce
> index size for some use-cases.
>
> Example
>
> Let's consider a library of research papers, where each paper is
> represented by
> a Lucene document and the paper's author is a facet field in that
> document. For
> each author we store the total number of citations. We want to compute a
> measure of each author's impact, the total number of citations divided by
> the number of articles published.
>
> Implementation
>
> Each author will be assigned an ordinal in the taxonomy. Lucene doesn't
> currently support storing data about an ordinal, but the taxonomy is
> itself a
> Lucene index, where each ordinal is represented by a document. Right now,
> the
> ordinal document has only a few fields allowing it to model the taxonomy
> structure, but we could conceivably add arbitrary fields to the ordinal
> documents. We would index the total number of citations an author has as a
> DocValue in the corresponding ordinal document.
>
> Advantages
>
> The alternative would be to denormalize data about the authors and have it
> on
> each doc that references that author. This leads to duplication. Since
> Lucene
> already has a document representation of the author (the ordinal doc), it
> makes sense conceptually that data about the author should be associated
> with the ordinal doc.
>
>
> I'm curious if anyone else has tried something like this and if the
> approach
> seems reasonable. I’ve made an attempt to code it and I can open a PR if
> this
> sounds like a useful feature.
>
> Stefan
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-09 Thread Shai Erera
Putting ChatGPT aside, what are the implications of (1) removing the limit,
or (2) increasing the limit, or (3) make it configurable at the app's
discretion? The configuration can even be in the form of a VectorEncoder
impl which will decide on the size of the vectors, thereby making it
clearer that this is an expert setting and puts it at the hands of the app
to decide how to handle those large vectors.

Will bigger vectors require an algorithmic change (I understand that it
might benefit from one, I ask if it's required besides performance gains)?
If not, then why do you object to making it an "app problem"? If 2048 dims
vectors require 2GB IW RAM buffer, what's wrong with documenting it and
letting the app choose whether they want/can do it or not?

Do we limit the size of stored fields, or binary doc values? Do we prevent
anyone from using a Codec which loads these big byte[] into memory?

> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
> LIBRARY. not a vector database or whatever trash is being proposed
> here.

The problem being discussed *is* related to search, not lexical search, but
rather semantic search, or search in vector space, however you want to call
it. Are you saying that Lucene should only focus on lexical search
scenarios? Or are you saying that semantic search scenarios don't need to
index >1024 dimension vectors in order to produce high quality results?

I personally don't understand why not letting apps index bigger vectors, if
all that it takes is using bigger RAM buffers. Improvements will come
later, especially as more and more applications will try it. If we prevent
it, then we might never see these improvements cause no one will even
attempt to do it with Lucene, and IMO it's not a direction we want to head.
While ChatGPT itself might be a hype, I don't think that big vectors are,
and if the only technical reason we have for not supporting them is a
bigger RAM buffer, then I think we should allow it.

On Sun, Apr 9, 2023 at 1:59 PM Robert Muir  wrote:

> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
> LIBRARY. not a vector database or whatever trash is being proposed
> here.
>
> i think we should table this and revisit it after chatgpt hype has
> dissipated.
>
> this hype is causing ppl to behave irrationally, it is why i can't
> converse with basically anyone on this thread because they are all
> stating crazy things that don't make sense.
>
> On Sun, Apr 9, 2023 at 6:25 AM Robert Muir  wrote:
> >
> > Yes, its very clear that folks on this thread are ignoring reason
> > entirely and completely swooned by chatgpt-hype.
> > And what happens when they make chatgpt-8 that uses even more dimensions?
> > backwards compatibility decisions can't be made by garbage hype such
> > as cryptocurrency or chatgpt.
> > Trying to convince me we should bump it because of chatgpt, well, i
> > think it has the opposite effect.
> >
> > Please, lemme see real technical arguments why this limit needs to be
> > bumped. not including trash like chatgpt.
> >
> > On Sat, Apr 8, 2023 at 7:50 PM Marcus Eagan 
> wrote:
> > >
> > > Given the massive amounts of funding going into the development and
> investigation of the project, I think it would be good to at least have
> Lucene be a part of the conversation. Simply because academics typically
> focus on vectors <= 784 dimensions does not mean all users will. A large
> swathe of very important users of the Lucene project never exceed 500k
> documents, though they are shifting to other search engines to try out very
> popular embeddings.
> > >
> > > I think giving our users the opportunity to build chat bots or LLM
> memory machines using Lucene is a positive development, even if some
> datasets won't be able to work well. We don't limit the number of fields
> someone can add in most cases, though we did just undeprecate that API to
> better support multi-tenancy. But people still add so many fields and can
> crash their clusters with mapping explosions when unlimited. The limit to
> vectors feels similar.  I expect more people to dig into Lucene due to its
> openness and robustness as they run into problems. Today, they are forced
> to consider other engines that are more permissive.
> > >
> > > Not everyone important or valuable Lucene workload is in the millions
> of documents. Many of them only have lots of queries or computationally
> expensive access patterns for B-trees.  We can document that it is very
> ill-advised to make a deployment with vectors too large. What others will
> do with it is on them.
> > >
> > >
> > > On Sat, Apr 8, 2023 at 2:29 PM Adrien Grand  wrote:
> > >>
> > >> As Dawid pointed out earlier on this thread, this is the rule for
> > >> Apache projects: a single -1 vote on a code change is a veto and
> > >> cannot be overridden. Furthermore, Robert is one of the people on this
> > >> project who worked the most on debugging subtle bugs, making Lucene
> > >> more robust and improving our test 

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-04 Thread Shai Erera
I am not familiar with the internal implementation details, but is it
possible to refactor the code such that someone can provide an extension of
some VectorEncoder/Decoder and control the limits on their side? Rather
than Lucene committing to some arbitrary limit (which these days seems to
keep growing)?

If raising the limit only means changing some hard-coded constant, then I
assume such an abstraction can work. We can mark this extension as
@lucene.expert.

Shai


On Tue, Apr 4, 2023 at 4:33 PM Michael McCandless 
wrote:

> > I am not in favor of just doubling it as suggested by some people, I
> would ideally prefer a solution that remains there to a decent extent,
> rather than having to modifying it anytime someone requires a higher limit.
>
> The problem with this approach is it is a one-way door, once released.  We
> would not be able to lower the limit again in the future without possibly
> breaking some applications.
>
> > For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they are too
> much for your system, you get terrible performance/crashes/whatever.
>
> Correction: we do check this limit and throw a specific exception now:
> https://github.com/apache/lucene/issues/6905
>
> +1 to raise the limit, but not remove it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 3, 2023 at 9:51 AM Alessandro Benedetti 
> wrote:
>
>> ... and what would be the next limit?
>> I guess we'll need to motivate it better than the 1024 one.
>> I appreciate the fact that a limit is pretty much wanted by everyone but
>> I suspect we'll need some solid foundation for deciding the amount (and it
>> should be high enough to avoid continuous changes)
>>
>> Cheers
>>
>> On Sun, 2 Apr 2023, 07:29 Michael Wechner, 
>> wrote:
>>
>>> btw, what was the reasoning to set the current limit to 1024?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>>
>>> I'm also in favor of raising this limit. We do see some datasets with
>>> higher than 1024 dims. I also think we need to keep a limit. For example we
>>> currently need to keep all the vectors in RAM while indexing and we want to
>>> be able to support reasonable numbers of vectors in an index segment. Also
>>> we don't know what innovations might come down the road. Maybe someday we
>>> want to do product quantization and enforce that (k, m) both fit in a byte
>>> -- we wouldn't be able to do that if a vector's dimension were to exceed
>>> 32K.
>>>
>>> On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti <
>>> a.benede...@sease.io> wrote:
>>>
 I am also curious what would be the worst-case scenario if we remove
 the constant at all (so automatically the limit becomes the Java
 Integer.MAX_VALUE).
 i.e.
 right now if you exceed the limit you get:

> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
> throw new IllegalArgumentException(
> "cannot index vectors with dimension greater than " + ByteVectorValues
> .MAX_DIMENSIONS);
> }


 in relation to:

> These limits allow us to
> better tune our data structures, prevent overflows, help ensure we
> have good test coverage, etc.


 I agree 100% especially for typing stuff properly and avoiding resource
 waste here and there, but I am not entirely sure this is the case for the
 current implementation i.e. do we have optimizations in place that assume
 the max dimension to be 1024?
 If I missed that (and I likely have), I of course suggest the
 contribution should not just blindly remove the limit, but do it
 appropriately.
 I am not in favor of just doubling it as suggested by some people, I
 would ideally prefer a solution that remains there to a decent extent,
 rather than having to modifying it anytime someone requires a higher limit.

 Cheers

 --
 *Alessandro Benedetti*
 Director @ Sease Ltd.
 *Apache Lucene/Solr Committer*
 *Apache Solr PMC Member*

 e-mail: a.benede...@sease.io


 *Sease* - Information Retrieval Applied
 Consulting | Training | Open Source

 Website: Sease.io 
 LinkedIn  | Twitter
  | Youtube
  | Github
 


 On Fri, 31 Mar 2023 at 16:12, Michael Wechner <
 michael.wech...@wyona.com> wrote:

> OpenAI reduced their size to 1536 dimensions
>
> https://openai.com/blog/new-and-improved-embedding-model
>
> so 2048 would work :-)
>
> but other services do provide also higher dimensions with sometimes
> slightly better accuracy
>
> Thanks
>
> Michael
>
>
> Am 31.03.23 um 14:45 schrieb Adrien Grand:
> > I'm 

Re: Finding out which fields matched the query

2022-06-29 Thread Shai Erera
I think it's a matter of tradeoff. For example when you do faceting then we
require complete evaluation, and since this field-matching is a kind of
aggregation I think it's OK if that's how it works. Users can choose which
technique they want to apply based on their usecase.

Anyway I don't think we must introduce this kind of collector in Lucene,
it's definitely something someone can write in his/her own project.

Shai

On Tue, Jun 28, 2022 at 4:09 PM Alan Woodward  wrote:

> I think it depends on what information we actually want to get here.  If
> it’s just finding which fields matched in which document, then running
> Matches over the top-k results is fine.  If you want to get some kind of
> aggregate data, as in you want to get a list of fields that matched in
> *any* document (or conversely, a list of fields that *didn’t* match -
> useful if you want to prune your schema, for example), then Matches will be
> too slow.  But at the same time, queries are designed to tell you which
> *documents* match efficiently, and they are allowed to advance their
> sub-queries lazily or indeed not at all if the result isn’t needed for
> scoring.  So we don’t really have any way of finding this kind of
> information via a collector that is accurate and performs reasonably.
>
> It *might* be possible to rework Matches so that they act more like an
> iterator and maintain their state within a segment, but there hasn’t been a
> pressing need for that so far.
>
> On 27 Jun 2022, at 12:46, Shai Erera  wrote:
>
> Thanks Alan, yeah I guess I was thinking about the usecase I described,
> which involves (usually) simple term queries, but you're definitely right
> about complex boolean clauses as well non-term queries.
>
> I think the case for highlighter is different though? I mean you usually
> generate highlights only for the top-K results and therefore are probably
> less affected by whether the matches() API is slower than a Collector. And
> if you invoke the API for every document in the index, it might be much
> slower (depending on the index size) than the Collector.
>
> Maybe a hybrid approach which runs the query and caches the docs in a
> DocIdSet (like FacetsCollector does) and then invokes the matches() API
> only on those hits, will let you enjoy the best of both worlds? Assuming
> though that the number of matching documents is not huge.
>
> So it seems there are several options and one should choose based on their
> usecase. Do you see an advantage for Lucene to offer a Collector for this
> usecase? Or should we tell users to use the matches API
>
> Shai
>
> On Mon, Jun 27, 2022 at 2:22 PM Dawid Weiss  wrote:
>
>> A side note - I've been using a highlighter based on matches API for
>> quite some time now and it's been fantastic. Very precise and handles
>> non-trivial queries (interval queries) very well.
>>
>>
>> https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward 
>> wrote:
>> >
>> > Your approach is almost certainly more efficient, but it might give you
>> false matches in some cases - for example, if you have a complex query with
>> many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is
>> positioned on the correct document, but which is part of a clause that
>> doesn’t actually match.  It also only works for term queries, so it won’t
>> match phrases or span/interval groups.  And Matches will work on points or
>> docvalues queries as well.  The reason I added Matches in the first place
>> was precisely to handle these weird corner cases - I had written
>> highlighters which more or less did the same thing you describe with a
>> Collector and the Scorable tree, and I would occasionally get bad
>> highlights back.
>> >
>> > On 27 Jun 2022, at 10:51, Shai Erera  wrote:
>> >
>> > Out of curiosity and for education purposes, is the Collector approach
>> I proposed wrong/inefficient? Or less efficient than the matches() API?
>> >
>> > I'm thinking, if you want to both match/rank documents and as a side
>> effect know which fields matched, the Collector will perform better than
>> Weight.matches(), but I could be wrong.
>> >
>> > Shai
>> >
>> > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss 
>> wrote:
>> >>
>> >> The matches API is awesome. Use it. You can also get a rough glimpse
>> >> into a superset of fields potentially matching the query via:
>> >>
>> >> query.visit(
>> >> new QueryVisitor() {
>> >>

Re: Store arrays in DocValues and keep the original order

2022-06-28 Thread Shai Erera
Depending on what you use the field for, you can use BinaryDocValuesField
which encodes a byte[] and lets you store the data however you want. But
how are you using these fields later at search time?

On Tue, Jun 28, 2022 at 3:46 PM linfeng lu  wrote:

> Hi~
>
> We are trying to build an OLAP database based on lucene, and we heavily
> use lucene's *DocValues* (as our column store).
>
> *We try to use DocValues to store the array type field. *For example, if
> we want to store the *field1* and *feild2* in this json document into
> *DocValues* respectively, SORTED_NUMERIC and SORTED_SET seem to be our
> only option.
>
> *{*
> *"field1": [ 3, 1, 1, 2 ], *
> *"field2": [ "c", "a", "a", "b" ] *
> *}*
>
>
> When we store *field1* in SORTED_NUMERIC and *field2* in SORTED_SET, we
> will get this result:
>
> *[image: Community Verified icon]*
>
> field1:
>
>- origin: [3, 1, 1, 2]
>- in SORTED_NUMERIC: [1, 1, 2, 3]
>
> field2:
>
>- origin: [”c”, “a”, “a”, “b” ]
>- in SORTED_SET: ords [0, 1, 2] terms [”a”, “b”, “c”]
>
>
> The original ordering relationship of the elements in the array is lost.
>
> We're guessing that lucene's DocValues are designed primarily for sorting
> and aggregation, so the original order of elements may not matter.
>
> But in our usage scene, it is important to keep the original order of the
> elements in the array (we allow user to access the elements in the array
> using the subscript operator).
>
> We wonder if lucene has plans to add new types of DocValues that can store
> arrays and keep the original order of elements in the array?
>
> Thanks!
>


Re: Finding out which fields matched the query

2022-06-27 Thread Shai Erera
Thanks Alan, yeah I guess I was thinking about the usecase I described,
which involves (usually) simple term queries, but you're definitely right
about complex boolean clauses as well non-term queries.

I think the case for highlighter is different though? I mean you usually
generate highlights only for the top-K results and therefore are probably
less affected by whether the matches() API is slower than a Collector. And
if you invoke the API for every document in the index, it might be much
slower (depending on the index size) than the Collector.

Maybe a hybrid approach which runs the query and caches the docs in a
DocIdSet (like FacetsCollector does) and then invokes the matches() API
only on those hits, will let you enjoy the best of both worlds? Assuming
though that the number of matching documents is not huge.

So it seems there are several options and one should choose based on their
usecase. Do you see an advantage for Lucene to offer a Collector for this
usecase? Or should we tell users to use the matches API

Shai

On Mon, Jun 27, 2022 at 2:22 PM Dawid Weiss  wrote:

> A side note - I've been using a highlighter based on matches API for
> quite some time now and it's been fantastic. Very precise and handles
> non-trivial queries (interval queries) very well.
>
>
> https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html
>
> Dawid
>
> On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward 
> wrote:
> >
> > Your approach is almost certainly more efficient, but it might give you
> false matches in some cases - for example, if you have a complex query with
> many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is
> positioned on the correct document, but which is part of a clause that
> doesn’t actually match.  It also only works for term queries, so it won’t
> match phrases or span/interval groups.  And Matches will work on points or
> docvalues queries as well.  The reason I added Matches in the first place
> was precisely to handle these weird corner cases - I had written
> highlighters which more or less did the same thing you describe with a
> Collector and the Scorable tree, and I would occasionally get bad
> highlights back.
> >
> > On 27 Jun 2022, at 10:51, Shai Erera  wrote:
> >
> > Out of curiosity and for education purposes, is the Collector approach I
> proposed wrong/inefficient? Or less efficient than the matches() API?
> >
> > I'm thinking, if you want to both match/rank documents and as a side
> effect know which fields matched, the Collector will perform better than
> Weight.matches(), but I could be wrong.
> >
> > Shai
> >
> > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss 
> wrote:
> >>
> >> The matches API is awesome. Use it. You can also get a rough glimpse
> >> into a superset of fields potentially matching the query via:
> >>
> >> query.visit(
> >> new QueryVisitor() {
> >>   @Override
> >>   public boolean acceptField(String field) {
> >> affectedFields.add(field);
> >> return false;
> >>   }
> >> });
> >>
> >>
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
> >>
> >> I'd go with the Matches API though.
> >>
> >> Dawid
> >>
> >> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward 
> wrote:
> >> >
> >> > The Matches API will give you this information - it’s still likely to
> be fairly slow, but it’s a lot easier to use than trying to parse Explain
> output.
> >> >
> >> > Query q = ….;
> >> > Weight w = searcher.createWeight(searcher.rewrite(query),
> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
> >> >
> >> > Matches m = w.matches(context, doc);
> >> > List matchingFields = new ArrayList();
> >> > for (String field : m) {
> >> >  matchingFields.add(field);
> >> > }
> >> >
> >> > Bear in mind that `matches` doesn’t maintain any state between calls,
> so calling it for every matching document is likely to be slow; for those
> cases Shai’s suggestion of using a Collector and examining low-level
> scorers will perform better, but it won’t work for every query type.
> >> >
> >> >
> >> > > On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
> >> > >
> >> > > Hello!
> >> > >
> >> > > I’m a MSCS student from BU and learning to use Lucene. Recently I
> try to output matched fields by one query. For

Re: Finding out which fields matched the query

2022-06-27 Thread Shai Erera
Thanks Uwe, I didn't know about named queries, but it seems useful. Is
there interest in getting similar functionality in Lucene, or perhaps just
the FieldMatching collector? I'd be happy to PR-it.

As for usecase, I was thinking of using something similar to this collector
for some kind of (simple) entity recognition task. If you have a corpus of
documents with many fields which denote product attributes, you could match
a word like "Red" to the various product attribute fields and determine
based on the matching fields + their doc count whether this word likely
represents a Color or Brand entity (hint: it matches both, the question is
which is more probable).

I'm sure there are other ways to achieve this, and probably much smarter
NER implementations, but this one is at least based on the actual data that
you index which guarantees something about the results you will receive if
applying a certain attribute filtering.

Shai

On Mon, Jun 27, 2022 at 1:01 PM Uwe Schindler  wrote:

> I think the collector approach is perfectly fine for mass-processing of
> queries.
>
> By the way: Elasticserach/Opensearch have a feature already built-in and
> it is working based on collector API in a similar way like you mentioned
> (as far as I remember). It is a bit different as you can tag any clause in
> a BQ (so every query) using a "name" (they call it "named query",
> https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries).
> When you get the search results, for each hit it tells you which named
> queries were a match on the hit. The actual implementation is some wrapper
> query on each of those clauses that contains the name. In hit collection it
> just collects all named query instances found in query tree. I think their
> implementation somehow the wrapper query scorer impl adds the name to some
> global state.
>
> Uwe
> Am 27.06.2022 um 11:51 schrieb Shai Erera:
>
> Out of curiosity and for education purposes, is the Collector approach I
> proposed wrong/inefficient? Or less efficient than the matches() API?
>
> I'm thinking, if you want to both match/rank documents and as a side
> effect know which fields matched, the Collector will perform better than
> Weight.matches(), but I could be wrong.
>
> Shai
>
> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss 
> wrote:
>
>> The matches API is awesome. Use it. You can also get a rough glimpse
>> into a superset of fields potentially matching the query via:
>>
>> query.visit(
>> new QueryVisitor() {
>>   @Override
>>   public boolean acceptField(String field) {
>> affectedFields.add(field);
>> return false;
>>   }
>> });
>>
>>
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>>
>> I'd go with the Matches API though.
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward 
>> wrote:
>> >
>> > The Matches API will give you this information - it’s still likely to
>> be fairly slow, but it’s a lot easier to use than trying to parse Explain
>> output.
>> >
>> > Query q = ….;
>> > Weight w = searcher.createWeight(searcher.rewrite(query),
>> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >
>> > Matches m = w.matches(context, doc);
>> > List matchingFields = new ArrayList();
>> > for (String field : m) {
>> >  matchingFields.add(field);
>> > }
>> >
>> > Bear in mind that `matches` doesn’t maintain any state between calls,
>> so calling it for every matching document is likely to be slow; for those
>> cases Shai’s suggestion of using a Collector and examining low-level
>> scorers will perform better, but it won’t work for every query type.
>> >
>> >
>> > > On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
>> > >
>> > > Hello!
>> > >
>> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try
>> to output matched fields by one query. For example, for one document, there
>> are 10 fields and 2 of them match the query. I want to get the name of
>> these fields.
>> > >
>> > > I have tried using explain() method and getting description then
>> regex. However it cost so much time.
>> > >
>> > > I wonder what is the efficient way to get the matched fields. Would
>> you please offer some help? Thank you so much!
>> > >
>> > > Best regards,
>> > > Yichen Sun
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>


Re: Finding out which fields matched the query

2022-06-27 Thread Shai Erera
Out of curiosity and for education purposes, is the Collector approach I
proposed wrong/inefficient? Or less efficient than the matches() API?

I'm thinking, if you want to both match/rank documents and as a side effect
know which fields matched, the Collector will perform better than
Weight.matches(), but I could be wrong.

Shai

On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss  wrote:

> The matches API is awesome. Use it. You can also get a rough glimpse
> into a superset of fields potentially matching the query via:
>
> query.visit(
> new QueryVisitor() {
>   @Override
>   public boolean acceptField(String field) {
> affectedFields.add(field);
> return false;
>   }
> });
>
>
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>
> I'd go with the Matches API though.
>
> Dawid
>
> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward 
> wrote:
> >
> > The Matches API will give you this information - it’s still likely to be
> fairly slow, but it’s a lot easier to use than trying to parse Explain
> output.
> >
> > Query q = ….;
> > Weight w = searcher.createWeight(searcher.rewrite(query),
> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
> >
> > Matches m = w.matches(context, doc);
> > List matchingFields = new ArrayList();
> > for (String field : m) {
> >  matchingFields.add(field);
> > }
> >
> > Bear in mind that `matches` doesn’t maintain any state between calls, so
> calling it for every matching document is likely to be slow; for those
> cases Shai’s suggestion of using a Collector and examining low-level
> scorers will perform better, but it won’t work for every query type.
> >
> >
> > > On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
> > >
> > > Hello!
> > >
> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try
> to output matched fields by one query. For example, for one document, there
> are 10 fields and 2 of them match the query. I want to get the name of
> these fields.
> > >
> > > I have tried using explain() method and getting description then
> regex. However it cost so much time.
> > >
> > > I wonder what is the efficient way to get the matched fields. Would
> you please offer some help? Thank you so much!
> > >
> > > Best regards,
> > > Yichen Sun
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Finding out which fields matched the query

2022-06-26 Thread Shai Erera
Hi Yichen,

I think you can implement a custom Collector which tracks the fields that
were matched for each Scorer. I implemented an example such Collector below:

public class FieldMatchingCollector implements Collector {

  /** Holds the number of matching documents for each field. */
  public final Map matchingFieldCounts = new HashMap<>();

  /** Holds which fields were matched for each document. */
  public final Map> docMatchingFields = new
HashMap<>();

  private final Set termScorers = new HashSet<>();

  @Override
  public ScoreMode scoreMode() {
return ScoreMode.COMPLETE_NO_SCORES;
  }

  @Override
  public LeafCollector getLeafCollector(LeafReaderContext context) {
final int docBase = context.docBase;
return new LeafCollector() {

  @Override
  public void setScorer(Scorable scorer) throws IOException {
termScorers.clear();
getSubTermScorers(scorer, termScorers);
  }

  @Override
  public void collect(int doc) {
int basedDoc = doc + docBase;
for (Scorer scorer : termScorers) {
  if (doc == scorer.docID()) {
// We know that we're dealing w/ TermScorers
String matchingField = ((TermQuery)
scorer.getWeight().getQuery()).getTerm().field();
docMatchingFields.computeIfAbsent(basedDoc, d -> new
HashSet<>()).add(matchingField);
matchingFieldCounts.merge(matchingField, 1, Integer::sum);
  }
}
  }
};
  }

  private void getSubTermScorers(Scorable scorer, Set set) throws
IOException {
if (scorer instanceof TermScorer) {
  set.add((Scorer) scorer);
} else {
  for (Scorable.ChildScorable child : scorer.getChildren()) {
getSubTermScorers(child.child, set);
  }
}
  }
}

This is of course an example implementation and you can optimize it to
match your needs (e.g. if you're only interested in the set of matching fields
you can change "matchingFieldCounts" to a Set). Note that
"docMatchingFields"
is expensive, I've only included it as an example (and for debugging
purposes), but I recommend omitting it in a real application.

To use it you can do something like:

// Need to use this searcher to guarantee the bulk scorer API isn't used.
IndexSearcher searcher = new ScorerIndexSearcher(reader);

// Parse the query to match against a list of searchable fields
QueryParser qp = new MultiFieldQueryParser(FIELDS_TO_SEARCH_ON, new
StandardAnalyzer());
Query query = qp.parse(queryText);

// Collect the matching fields
FieldMatchingCollector fieldMatchingCollector = new
FieldMatchingCollector();
// If needed, collect the top matching documents too
TopScoreDocCollector topScoreDocCollector = TopScoreDocCollector.create(10,
Integer.MAX_VALUE);
searcher.search(query, MultiCollector.wrap(topScoreDocCollector,
fieldMatchingCollector));

System.out.println("matchingFieldCounts = " +
fieldMatchingCollector.matchingFieldCounts);
System.out.println("docMatchingFields = " +
fieldMatchingCollector.docMatchingFields);
System.out.println("totalHits = " + topScoreDocCollector.getTotalHits());

Hope this helps!

Shai

On Sat, Jun 25, 2022 at 7:58 AM Yichen Sun  wrote:

> Hello!
>
> I’m a MSCS student from BU and learning to use Lucene. Recently I try to
> output matched fields by one query. For example, for one document, there
> are 10 fields and 2 of them match the query. I want to get the name of
> these fields.
>
> I have tried using explain() method and getting description then regex.
> However it cost so much time.
>
> I wonder what is the efficient way to get the matched fields. Would you
> please offer some help? Thank you so much!
>
> Best regards,
> Yichen Sun
>


Re: Plan for GitHub issue metadata management

2022-06-20 Thread Shai Erera
Can we support "Affects Versions" with a label too? "affectsVersion: 8.x"?

Regarding Fix Versions, don't we have multiple of these sometimes? E.g. a
bug fix may go into "8.1", "9.x" and "main"? Is it OK if we just drop
support for this?

On Mon, Jun 20, 2022 at 12:33 PM Tomoko Uchida 
wrote:

> Hello all.
>
> Besides whether the migration of existing issues should be done or not
> (we still do not reach an agreement on it), I started to play around
> with GitHub issue metadata with a test repository.
>
> The current migration plan in my mind:
>
> * Issue Type -> Supported with labels (e.g. "type:bug"); it also can
> be attached when opening issues with issue templates.
> * Issue Priority -> Not supported.
> * Affects Versions -> Not supported.
> * Components -> Supported with labels (e.g.: "module:core").
> * Resolution -> Not supported.
> * Fix Version(s) -> Partially supported with Milestone; an issue can
> have only one milestone - I'm fine with it.
>
> As you may see I'm going to drop the most of metadata that is
> supported in Jira for the sake of brevity. If you have objections or
> other perspectives, could you please speak up.
>
> Tomoko
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Welcome Greg Miller to the Lucene PMC

2022-06-07 Thread Shai Erera
Welcome Greg!

On Tue, Jun 7, 2022 at 12:17 PM Bruno Roustant 
wrote:

> Welcome Greg!
>
> Le mar. 7 juin 2022 à 08:37, Adrien Grand  a écrit :
>
>> I'm pleased to announce that Greg Miller has accepted an invitation to
>> join the Lucene PMC!
>>
>> Congratulations Greg, and welcome aboard!
>>
>> --
>> Adrien
>>
>


Re: [VOTE] Solr to become a top-level Apache project (TLP)

2020-05-12 Thread Shai Erera
I agree this is a procedural vote. Here's my +1 for the proposal.

Shai

On Tue, May 12, 2020, 23:07 Simon Willnauer 
wrote:

> I agree this is not a code change category vote. It’s a majority vote. -1s
> are not vetos.
>
> Simon
>
> On 12. May 2020, at 21:17, Atri Sharma  wrote:
>
> 
> I would argue against that — this is more of a project level decision with
> no changes to the core code base per se — more of restructuring of it. Sort
> of how a sub project becomes a TLP.
>
> On Wed, 13 May 2020 at 00:38, Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
>> This is in the code modification category, since code will be modified as
>> result of this proposal.
>>
>> On Wed, 13 May, 2020, 12:27 am Shawn Heisey,  wrote:
>>
>>> On 5/12/2020 1:36 AM, Dawid Weiss wrote:
>>> > According to an earlier [DISCUSS] thread on the dev list [2], I am
>>> > calling for a vote on the proposal to make Solr a top-level Apache
>>> > project (TLP) and separate Lucene and Solr development into two
>>> > independent entities.
>>>
>>> +1 (pmc)
>>>
>>> We should clarify exactly what kind of vote this is.  If it is in the
>>> "code modification" category, then a single -1 vote would be enough to
>>> defeat the proposal.  There are already some -1 votes.
>>>
>>> Thanks,
>>> Shawn
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>> --
> Regards,
>
> Atri
> Apache Concerted
>
>


Re: [DISCUSS] Lucene-Solr split (Solr promoted to TLP)

2020-05-04 Thread Shai Erera
Interesting data Michael. I am not sure though that the shared commits tell
us that there are people that contribute to both projects. Eventually, an
API change/update in Lucene will require a change in Solr (but not vice
versa). Those commits will still occur in both projects, only on the Solr
side they will occur when Solr will upgrade to the respective Lucene
version.

I wonder if we can tell, out of the shared commits, how many started in
Lucene and ended in Solr because of the shared build (i.e. an API change
required Solr code changes for the build to pass), vs how many started in
Solr, and ended in Lucene because a core change was needed to support the
Solr feature/update. The first case does not indicate, IMO, a shared
contribution (whoever changes a Lucene API will not then go and update Solr
and Elasticsearch if the projects were split), while the second case is a
stronger indication of a shared contribution.

Maybe if we could "label" committers as mostly Lucene/Solr, we could tell
more about the shared commits?

Anyway, data is good, I agree.

Shai

On Mon, May 4, 2020 at 5:49 PM Michael Sokolov  wrote:

> I always like to look at data when making a big decision, so I
> gathered some statistics about authors and commits to git over the
> history of the project. I wanted to see what these statistics could
> tell us about the degree of overlap between the two projects and
> whether it has changed over time. Using commands like
>
>  git log --pretty=%an --since=2012 --lucene
>  git log --pretty=%an --since=2012 --solr
>
> I looked at the authors of commits in the lucene and solr top-level
> folders of the project. I think this makes a reasonable proxy for
> contributors to the two projects. From there I found that since 2012,
> there are 60 Lucene-only authors, 71 Solr-only authors, and 101
> authors (or 43%) contributing at least one commit to each project.
> Since 2018, the percentage of both-project authors is somewhat lower:
> 36%.
>
> I also looked at commits spanning both projects. I'm not sure this
> captures all the work that touches both projects, but it's a window
> into that, at least. I found that since 2012, 1387/19063 (6.8%) of
> commits spanned both project folders. Since 2018, 7.4% did.
>
> I don't think you can really draw very many meaningful conclusions
> from this, but a few things jump out: First, it is clear that these
> projects are not completely separate today. A substantial number of
> people commit to both, over time, although most people do not. Also,
> relatively few commits span both projects. Some do though, and it's
> certainly worth considering what the workflow for such changes would
> be like in the split world. Maybe a majority of these are
> build-related; it's hard to tell from this coarse analysis.
>
>
> On Mon, May 4, 2020 at 5:11 AM Dawid Weiss  wrote:
> >
> > Dear Lucene and Solr developers!
> >
> > A few days ago, I initiated a discussion among PMC members about
> > potential pros and cons of splitting the project into separate Lucene
> > and Solr entities by promoting Solr to its own top-level Apache
> > project (TLP). Let me share with you the motivation for such an action
> > and some follow-up thoughts I heard from other PMC members so far.
> >
> > Please read this e-mail carefully. Both the PMC and I look forward to
> > hearing your opinion. This is a DISCUSS thread and it will be followed
> > next week by a VOTE thread. This is our shared project and we should
> > all shape its future responsibly.
> >
> > The big question is this: “Is this the right time to split Solr and
> > Lucene into two independent projects?”.
> >
> > Here are several technical considerations that drove me to ask the
> > question above (in no order of priorities):
> >
> > 1) Precommit/ test times. These are crazy high. If we split into two
> > projects we can pretty much cut all of Lucene testing out of Solr (and
> > likewise), making development a bit more fun again.
> >
> > 2) Build system itself and source release packaging. The current
> > combined codebase is a *beast* to maintain. Working with gradle on
> > both projects at once made me realise how little the two have in
> > common. The code layout, the dependencies, even the workflow of people
> >
> > working on these projects... The build (both ant and gradle) is full
> > of Solr and Lucene-specific exceptions and hooks that could be more
> > elegantly solved if moved to each project independently.
> >
> > 3) Packaging. There is no single source distribution package for
> > Solr+Lucene. They are already "independent" there. Why should Lucene
> > and Solr always be released at the same pace? Does it always make
> > sense?
> >
> > 4) Solr is essentially taking in Lucene and its dependencies as a
> > whole (so is Elasticsearch and many other projects). In my opinion
> > this makes Lucene eligible for refactoring and
> >
> > maintenance as a separate component. The learning curve for people
> > coming to each 

Re: Welcome Jason Gerlowski to the PMC

2019-02-22 Thread Shai Erera
Congratulations Jason!

On Fri, Feb 22, 2019, 23:20 Anshum Gupta  wrote:

> Congratulations and welcome Jason!
>
> *  *Anshum
>
>
> On Feb 22, 2019, at 7:21 AM, Jan Høydahl  wrote:
>
> I am pleased to announce that Jason Gerlowski has accepted the PMC's
> invitation to join.
>
> Welcome Jason!
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> 
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
>
>
>


Re: Welcome Nick Knize to the PMC

2019-01-09 Thread Shai Erera
Welcome!

On Wed, Jan 9, 2019 at 9:30 PM Christine Poerschke (BLOOMBERG/ LONDON) <
cpoersc...@bloomberg.net> wrote:

> Welcome Nick!
>
> From: dev@lucene.apache.org At: 01/09/19 15:12:38
> To: dev@lucene.apache.org
> Subject: Welcome Nick Knize to the PMC
>
> I am pleased to announce that Nick Knize has accepted the PMC's
> invitation to join.
>
> Welcome Nick!
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>
>


[jira] [Resolved] (LUCENE-8588) Replace usage of deprecated RAMOutputStream

2018-12-04 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-8588.

   Resolution: Fixed
Fix Version/s: 7.7
   master (8.0)

> Replace usage of deprecated RAMOutputStream
> ---
>
> Key: LUCENE-8588
> URL: https://issues.apache.org/jira/browse/LUCENE-8588
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>    Reporter: Shai Erera
>    Assignee: Shai Erera
>Priority: Trivial
> Fix For: master (8.0), 7.7
>
> Attachments: LUCENE-8588.patch
>
>
> While reviewing code in {{FrozenBufferedUpdates}} I noticed that it uses the 
> deprecated {{RAMOutputStream}}. This issue fixes it. Separately we should 
> reduce the usage of that class, so that we can really remove it.
>  
> Besides that, while running tests I hit a test failure which at first I 
> thought was related to this change, but then noticed that the test doesn't 
> close the DirectoryReader (I run tests on Windows), so that fix is included 
> in this patch too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8588) Replace usage of deprecated RAMOutputStream

2018-12-04 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708780#comment-16708780
 ] 

Shai Erera commented on LUCENE-8588:


[~dweiss] thanks for pointing that out. I will not commit that change then. I 
pushed a commit that closes the DirReader in the test and one that fixes a 
typo. Thanks!

> Replace usage of deprecated RAMOutputStream
> ---
>
> Key: LUCENE-8588
> URL: https://issues.apache.org/jira/browse/LUCENE-8588
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>    Reporter: Shai Erera
>    Assignee: Shai Erera
>Priority: Trivial
> Attachments: LUCENE-8588.patch
>
>
> While reviewing code in {{FrozenBufferedUpdates}} I noticed that it uses the 
> deprecated {{RAMOutputStream}}. This issue fixes it. Separately we should 
> reduce the usage of that class, so that we can really remove it.
>  
> Besides that, while running tests I hit a test failure which at first I 
> thought was related to this change, but then noticed that the test doesn't 
> close the DirectoryReader (I run tests on Windows), so that fix is included 
> in this patch too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8588) Replace usage of deprecated RAMOutputStream

2018-12-04 Thread Shai Erera (JIRA)
Shai Erera created LUCENE-8588:
--

 Summary: Replace usage of deprecated RAMOutputStream
 Key: LUCENE-8588
 URL: https://issues.apache.org/jira/browse/LUCENE-8588
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Shai Erera
Assignee: Shai Erera


While reviewing code in {{FrozenBufferedUpdates}} I noticed that it uses the 
deprecated {{RAMOutputStream}}. This issue fixes it. Separately we should 
reduce the usage of that class, so that we can really remove it.

 

Besides that, while running tests I hit a test failure which at first I thought 
was related to this change, but then noticed that the test doesn't close the 
DirectoryReader (I run tests on Windows), so that fix is included in this patch 
too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8397) Add DirectoryTaxonomyWriter.getCache

2018-07-13 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542887#comment-16542887
 ] 

Shai Erera commented on LUCENE-8397:


+1

> Add DirectoryTaxonomyWriter.getCache
> 
>
> Key: LUCENE-8397
> URL: https://issues.apache.org/jira/browse/LUCENE-8397
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8397.patch
>
>
> {{DirectoryTaxonomyWriter}} uses a cache to hold recently mapped labels / 
> ordinals.  You can provide an impl when you create the class, or it will use 
> a default impl.
>  
> I'd like to add a getter, {{DirectoryTaxonomyWriter.getCache}} to retrieve 
> the cache it's using; this is helpful for getting diagnostics (how many 
> cached labels, how much RAM used, etc.).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8272) Share internal DV update code between binary and numeric

2018-04-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449891#comment-16449891
 ] 

Shai Erera commented on LUCENE-8272:


I put some comments on the PR, but I don't see them mentioned here, so FYI.

> Share internal DV update code between binary and numeric
> 
>
> Key: LUCENE-8272
> URL: https://issues.apache.org/jira/browse/LUCENE-8272
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.4, master (8.0)
>Reporter: Simon Willnauer
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: LUCENE-8272.patch
>
>
> Today we duplicate a fair portion of the internal logic to
> apply updates of binary and numeric doc values. This change refactors
> this non-trivial code to share the same code path and only differ in
> if we provide a binary or numeric instance. This also allows us to
> iterator over the updates only once rather than twice once for numeric
> and once for binary fields.
> 
> This change also subclass DocValuesIterator from 
> DocValuesFieldUpdates.Iterator
> which allows easier consumption down the road since it now shares most of 
> it's
> interface with DocIdSetIterator which is the main interface for this in 
> Lucene.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome to the PMC

2018-04-02 Thread Shai Erera
Welcome!

On Tue, Apr 3, 2018, 01:22 Mark Miller  wrote:

> Welcome!
> On Mon, Apr 2, 2018 at 3:49 PM Adrien Grand  wrote:
>
>> I am pleased to announce that Cao Mạnh Đạt has accepted the PMC's
>> invitation to join.
>>
>> Welcome Đạt!
>>
> --
> - Mark
> about.me/markrmiller
>


Re: Welcome Jason Gerlowski as committer

2018-02-08 Thread Shai Erera
Welcome!

On Thu, Feb 8, 2018, 20:56 Joel Bernstein  wrote:

> Welcome Jason!
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Feb 8, 2018 at 12:41 PM, Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
>> Congratulations Jason! :-)
>>
>> On Thu, Feb 8, 2018 at 10:38 PM, Karl Wright  wrote:
>>
>>> Hello Jason!
>>>
>>>
>>> On Thu, Feb 8, 2018 at 12:06 PM, Dawid Weiss 
>>> wrote:
>>>
 Welcome Jason!

 Dawid

 On Thu, Feb 8, 2018 at 6:04 PM, Adrien Grand  wrote:
 > Welcome Jason!
 >
 > Le jeu. 8 févr. 2018 à 18:03, David Smiley 
 a
 > écrit :
 >>
 >> Hello everyone,
 >>
 >> It's my pleasure to announce that Jason Gerlowski is our latest
 committer
 >> for Lucene/Solr in recognition for his contributions to the
 project!  Please
 >> join me in welcoming him.  Jason, it's tradition for you to introduce
 >> yourself with a brief bio.
 >>
 >> Congratulations and Welcome!
 >> --
 >> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
 >> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
 >> http://www.solrenterprisesearchserver.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


>>>
>>
>


Re: Welcome Jim Ferenczi to the PMC

2017-12-20 Thread Shai Erera
Welcome!

On Wed, Dec 20, 2017 at 6:07 PM Erick Erickson 
wrote:

> Welcome!
>
> On Wed, Dec 20, 2017 at 7:23 AM, Joel Bernstein 
> wrote:
> > Welcome Jim!
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Wed, Dec 20, 2017 at 10:12 AM, David Smiley  >
> > wrote:
> >>
> >> Welcome Jim!
> >>
> >> On Wed, Dec 20, 2017 at 9:28 AM Steve Rowe  wrote:
> >>>
> >>> Congrats and welcome Jim!
> >>>
> >>> --
> >>> Steve
> >>> www.lucidworks.com
> >>>
> >>> > On Dec 20, 2017, at 5:18 AM, Adrien Grand  wrote:
> >>> >
> >>> > I am pleased to announce that Jim Ferenczi has accepted the PMC's
> >>> > invitation to join.
> >>> >
> >>> > Welcome Jim!
> >>>
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>
> >> --
> >> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> >> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> >> http://www.solrenterprisesearchserver.com
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Commented] (LUCENE-8060) Require users to tell us whether they need total hit counts

2017-11-22 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263175#comment-16263175
 ] 

Shai Erera commented on LUCENE-8060:


What if we conceptually remove {{TopDocs.totalHits}} and if users require that, 
they can chain their Collector with {{TotalHitCountCollector}}? We can also add 
that boolean as a sugar to {{IndexSearcher.search()}} API.

If we're OK w/ removing {{TopDocs.totalHits}}, and users getting a compilation 
error (that's easy to fix), then that's an easy option/change. Or... we 
deprecate it, but keep the simple IndexSearcher.search() APIs still compute it 
(by chaining this collector), and let users who'd like to optimize use the 
search() API which takes a Collector.

Just a thought...

> Require users to tell us whether they need total hit counts
> ---
>
> Key: LUCENE-8060
> URL: https://issues.apache.org/jira/browse/LUCENE-8060
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (8.0)
>
>
> We are getting optimizations when hit counts are not required (sorted 
> indexes, MAXSCORE, short-circuiting of phrase queries) but our users won't 
> benefit from them unless we disable exact hit counts by default or we require 
> them to tell us whether hit counts are required.
> I think making hit counts approximate by default is going to be a bit trappy, 
> so I'm rather leaning towards requiring users to tell us explicitly whether 
> they need total hit counts. I can think of two ways to do that: either by 
> passing a boolean to the IndexSearcher constructor or by adding a boolean to 
> all methods that produce TopDocs instances. I like the latter better but I'm 
> open to discussion or other ideas?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Noble Paul to the PMC

2017-11-20 Thread Shai Erera
Welcome Noble!

On Mon, Nov 20, 2017 at 3:54 PM Steve Rowe  wrote:

> Congrats and welcome Noble!
>
> --
> Steve
> www.lucidworks.com
>
> > On Nov 19, 2017, at 3:02 PM, Adrien Grand  wrote:
> >
> > I am pleased to announce that Noble Paul has accepted the PMC's
> invitation to join.
> >
> > Welcome Noble!
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Welcome Mike Drob as Lucene/Solr committer

2017-05-08 Thread Shai Erera
Welcome Mike!

On Tue, May 9, 2017, 03:47 Đạt Cao Mạnh  wrote:

> Congrats Mike!
> On Tue, May 9, 2017 at 7:35 AM Dennis Gove  wrote:
>
>> Welcome Mike!
>>
>> On Mon, May 8, 2017 at 11:42 AM, Mark Miller 
>> wrote:
>>
>>> I'm pleased to announce that Mike Drob has accepted the PMC's
>>> invitation to become a committer.
>>>
>>> Mike, it's tradition that you introduce yourself with a brief bio /
>>> origin story, explaining how you arrived here.
>>>
>>> Your existing Apache handle has already added to the “lucene" LDAP
>>> group, so you now have commit privileges.
>>>
>>> Please celebrate this rite of passage, and confirm that the right
>>> karma has in fact enabled, by embarking on the challenge of adding
>>> yourself to the committers section of the Who We Are page on the
>>> website: http://lucene.apache.org/whoweare.html (use the ASF CMS
>>> bookmarklet
>>> at the bottom of the page here: https://cms.apache.org/#bookmark -
>>> more info here http://www.apache.org/dev/cms.html).
>>>
>>> Congratulations and welcome!
>>> --
>>> - Mark
>>> about.me/markrmiller
>>>
>>


[jira] [Resolved] (SOLR-10505) Support terms' statistics for multiple fields in TermsComponent

2017-04-20 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved SOLR-10505.
---
   Resolution: Fixed
Fix Version/s: master (7.0)
   6.6

Pushed to master and branch_6x.

> Support terms' statistics for multiple fields in TermsComponent
> ---
>
> Key: SOLR-10505
> URL: https://issues.apache.org/jira/browse/SOLR-10505
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Fix For: 6.6, master (7.0)
>
> Attachments: SOLR-10505.patch
>
>
> Currently if you specify multiple {{terms.fl}} parameters on the request, 
> while requesting terms' statistics, you get them for the first requested 
> field (because the code only uses {{fields[0]}}). There's no reason why not 
> to return the stats for the terms in all specified fields. It's a rather 
> simple change, and I will post a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10505) Support terms' statistics for multiple fields in TermsComponent

2017-04-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973296#comment-15973296
 ] 

Shai Erera commented on SOLR-10505:
---

All tests pass, if there are no objections, I'd like to commit this.

> Support terms' statistics for multiple fields in TermsComponent
> ---
>
> Key: SOLR-10505
> URL: https://issues.apache.org/jira/browse/SOLR-10505
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: SOLR-10505.patch
>
>
> Currently if you specify multiple {{terms.fl}} parameters on the request, 
> while requesting terms' statistics, you get them for the first requested 
> field (because the code only uses {{fields[0]}}). There's no reason why not 
> to return the stats for the terms in all specified fields. It's a rather 
> simple change, and I will post a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10505) Support terms' statistics for multiple fields in TermsComponent

2017-04-17 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated SOLR-10505:
--
Attachment: SOLR-10505.patch

Patch with tests.

> Support terms' statistics for multiple fields in TermsComponent
> ---
>
> Key: SOLR-10505
> URL: https://issues.apache.org/jira/browse/SOLR-10505
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: SOLR-10505.patch
>
>
> Currently if you specify multiple {{terms.fl}} parameters on the request, 
> while requesting terms' statistics, you get them for the first requested 
> field (because the code only uses {{fields[0]}}). There's no reason why not 
> to return the stats for the terms in all specified fields. It's a rather 
> simple change, and I will post a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-10505) Support terms' statistics for multiple fields in TermsComponent

2017-04-17 Thread Shai Erera (JIRA)
Shai Erera created SOLR-10505:
-

 Summary: Support terms' statistics for multiple fields in 
TermsComponent
 Key: SOLR-10505
 URL: https://issues.apache.org/jira/browse/SOLR-10505
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Shai Erera
Assignee: Shai Erera


Currently if you specify multiple {{terms.fl}} parameters on the request, while 
requesting terms' statistics, you get them for the first requested field 
(because the code only uses {{fields[0]}}). There's no reason why not to return 
the stats for the terms in all specified fields. It's a rather simple change, 
and I will post a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-10349) Add totalTermFreq support to TermsComponent

2017-03-28 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved SOLR-10349.
---
Resolution: Fixed

Pushed to master and branch_6x.

> Add totalTermFreq support to TermsComponent
> ---
>
> Key: SOLR-10349
> URL: https://issues.apache.org/jira/browse/SOLR-10349
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>    Reporter: Shai Erera
>    Assignee: Shai Erera
>Priority: Minor
> Fix For: master (7.0), 6.6
>
> Attachments: SOLR-10349.patch, SOLR-10349.patch, SOLR-10349.patch
>
>
> See discussion here: http://markmail.org/message/gmpmege2jpfrsp75. Both 
> {{docFreq}} and {{totalTermFreq}} are already available to the 
> TermsComponent, it's just that doesn't add the ttf measure to the response.
> This issue adds a new {{terms.ttf}} parameter which if set to true results in 
> the following output:
> {noformat}
> 
>   
> 
>   2
>   2
> 
> ...
> {noformat}
> The reason for the new parameter is to not break backward-compatibility, 
> though I wish we could always return those two measures (it doesn't cost us 
> anything, the two are already available to the code). Maybe we can break the 
> response in {{master}} and add this parameter only to {{6x}} as deprecated? I 
> am also fine if we leave it and handle it in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10349) Add totalTermFreq support to TermsComponent

2017-03-28 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated SOLR-10349:
--
Fix Version/s: 6.6
   master (7.0)

> Add totalTermFreq support to TermsComponent
> ---
>
> Key: SOLR-10349
> URL: https://issues.apache.org/jira/browse/SOLR-10349
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>    Reporter: Shai Erera
>    Assignee: Shai Erera
>Priority: Minor
> Fix For: master (7.0), 6.6
>
> Attachments: SOLR-10349.patch, SOLR-10349.patch, SOLR-10349.patch
>
>
> See discussion here: http://markmail.org/message/gmpmege2jpfrsp75. Both 
> {{docFreq}} and {{totalTermFreq}} are already available to the 
> TermsComponent, it's just that doesn't add the ttf measure to the response.
> This issue adds a new {{terms.ttf}} parameter which if set to true results in 
> the following output:
> {noformat}
> 
>   
> 
>   2
>   2
> 
> ...
> {noformat}
> The reason for the new parameter is to not break backward-compatibility, 
> though I wish we could always return those two measures (it doesn't cost us 
> anything, the two are already available to the code). Maybe we can break the 
> response in {{master}} and add this parameter only to {{6x}} as deprecated? I 
> am also fine if we leave it and handle it in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10349) Add totalTermFreq support to TermsComponent

2017-03-25 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941838#comment-15941838
 ] 

Shai Erera commented on SOLR-10349:
---

If there are no objections, I'd like to commit that tomorrow.

> Add totalTermFreq support to TermsComponent
> ---
>
> Key: SOLR-10349
> URL: https://issues.apache.org/jira/browse/SOLR-10349
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>    Reporter: Shai Erera
>    Assignee: Shai Erera
>Priority: Minor
> Attachments: SOLR-10349.patch, SOLR-10349.patch, SOLR-10349.patch
>
>
> See discussion here: http://markmail.org/message/gmpmege2jpfrsp75. Both 
> {{docFreq}} and {{totalTermFreq}} are already available to the 
> TermsComponent, it's just that doesn't add the ttf measure to the response.
> This issue adds a new {{terms.ttf}} parameter which if set to true results in 
> the following output:
> {noformat}
> 
>   
> 
>   2
>   2
> 
> ...
> {noformat}
> The reason for the new parameter is to not break backward-compatibility, 
> though I wish we could always return those two measures (it doesn't cost us 
> anything, the two are already available to the code). Maybe we can break the 
> response in {{master}} and add this parameter only to {{6x}} as deprecated? I 
> am also fine if we leave it and handle it in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10349) Add totalTermFreq support to TermsComponent

2017-03-23 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated SOLR-10349:
--
Attachment: SOLR-10349.patch

That was a good comment [~joel.bernstein]!! I changed more code to adapt the 
new format when necessary. Running tests now, but if you think/know of other 
places which might be affected by this change, please let me know.

> Add totalTermFreq support to TermsComponent
> ---
>
> Key: SOLR-10349
> URL: https://issues.apache.org/jira/browse/SOLR-10349
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>    Reporter: Shai Erera
>    Assignee: Shai Erera
>Priority: Minor
> Attachments: SOLR-10349.patch, SOLR-10349.patch, SOLR-10349.patch
>
>
> See discussion here: http://markmail.org/message/gmpmege2jpfrsp75. Both 
> {{docFreq}} and {{totalTermFreq}} are already available to the 
> TermsComponent, it's just that doesn't add the ttf measure to the response.
> This issue adds a new {{terms.ttf}} parameter which if set to true results in 
> the following output:
> {noformat}
> 
>   
> 
>   2
>   2
> 
> ...
> {noformat}
> The reason for the new parameter is to not break backward-compatibility, 
> though I wish we could always return those two measures (it doesn't cost us 
> anything, the two are already available to the code). Maybe we can break the 
> response in {{master}} and add this parameter only to {{6x}} as deprecated? I 
> am also fine if we leave it and handle it in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10349) Add totalTermFreq support to TermsComponent

2017-03-23 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938277#comment-15938277
 ] 

Shai Erera commented on SOLR-10349:
---

Thanks [~joel.bernstein], the distributed test suggestion helped me find 
{{DistributedTermsComponentTest}}, and of course as soon as I added a test to 
it, the client failed. Since it expects a number, but got a map. I will see how 
to fix it.

This also answers your second question, this commit changes the response 
structure if you ask for {{terms.ttf}}. I put an example output in the 
description above.

> Add totalTermFreq support to TermsComponent
> ---
>
> Key: SOLR-10349
> URL: https://issues.apache.org/jira/browse/SOLR-10349
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>    Reporter: Shai Erera
>    Assignee: Shai Erera
>Priority: Minor
> Attachments: SOLR-10349.patch, SOLR-10349.patch
>
>
> See discussion here: http://markmail.org/message/gmpmege2jpfrsp75. Both 
> {{docFreq}} and {{totalTermFreq}} are already available to the 
> TermsComponent, it's just that doesn't add the ttf measure to the response.
> This issue adds a new {{terms.ttf}} parameter which if set to true results in 
> the following output:
> {noformat}
> 
>   
> 
>   2
>   2
> 
> ...
> {noformat}
> The reason for the new parameter is to not break backward-compatibility, 
> though I wish we could always return those two measures (it doesn't cost us 
> anything, the two are already available to the code). Maybe we can break the 
> response in {{master}} and add this parameter only to {{6x}} as deprecated? I 
> am also fine if we leave it and handle it in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10349) Add totalTermFreq support to TermsComponent

2017-03-23 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated SOLR-10349:
--
Attachment: SOLR-10349.patch

Added CHANGES entry.

> Add totalTermFreq support to TermsComponent
> ---
>
> Key: SOLR-10349
> URL: https://issues.apache.org/jira/browse/SOLR-10349
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>    Reporter: Shai Erera
>    Assignee: Shai Erera
>Priority: Minor
> Attachments: SOLR-10349.patch, SOLR-10349.patch
>
>
> See discussion here: http://markmail.org/message/gmpmege2jpfrsp75. Both 
> {{docFreq}} and {{totalTermFreq}} are already available to the 
> TermsComponent, it's just that doesn't add the ttf measure to the response.
> This issue adds a new {{terms.ttf}} parameter which if set to true results in 
> the following output:
> {noformat}
> 
>   
> 
>   2
>   2
> 
> ...
> {noformat}
> The reason for the new parameter is to not break backward-compatibility, 
> though I wish we could always return those two measures (it doesn't cost us 
> anything, the two are already available to the code). Maybe we can break the 
> response in {{master}} and add this parameter only to {{6x}} as deprecated? I 
> am also fine if we leave it and handle it in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10349) Add totalTermFreq support to TermsComponent

2017-03-23 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated SOLR-10349:
--
Attachment: SOLR-10349.patch

Patch implements the proposed addition. [~joel.bernstein], not sure if you're 
still interested reviewing this, but if you are, your comments are appreciated!

> Add totalTermFreq support to TermsComponent
> ---
>
> Key: SOLR-10349
> URL: https://issues.apache.org/jira/browse/SOLR-10349
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>    Reporter: Shai Erera
>    Assignee: Shai Erera
>Priority: Minor
> Attachments: SOLR-10349.patch
>
>
> See discussion here: http://markmail.org/message/gmpmege2jpfrsp75. Both 
> {{docFreq}} and {{totalTermFreq}} are already available to the 
> TermsComponent, it's just that doesn't add the ttf measure to the response.
> This issue adds a new {{terms.ttf}} parameter which if set to true results in 
> the following output:
> {noformat}
> 
>   
> 
>   2
>   2
> 
> ...
> {noformat}
> The reason for the new parameter is to not break backward-compatibility, 
> though I wish we could always return those two measures (it doesn't cost us 
> anything, the two are already available to the code). Maybe we can break the 
> response in {{master}} and add this parameter only to {{6x}} as deprecated? I 
> am also fine if we leave it and handle it in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-10349) Add totalTermFreq support to TermsComponent

2017-03-23 Thread Shai Erera (JIRA)
Shai Erera created SOLR-10349:
-

 Summary: Add totalTermFreq support to TermsComponent
 Key: SOLR-10349
 URL: https://issues.apache.org/jira/browse/SOLR-10349
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor


See discussion here: http://markmail.org/message/gmpmege2jpfrsp75. Both 
{{docFreq}} and {{totalTermFreq}} are already available to the TermsComponent, 
it's just that doesn't add the ttf measure to the response.

This issue adds a new {{terms.ttf}} parameter which if set to true results in 
the following output:

{noformat}

  

  2
  2

...
{noformat}

The reason for the new parameter is to not break backward-compatibility, though 
I wish we could always return those two measures (it doesn't cost us anything, 
the two are already available to the code). Maybe we can break the response in 
{{master}} and add this parameter only to {{6x}} as deprecated? I am also fine 
if we leave it and handle it in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Getting totalTermFreq and docFreq for terms

2017-02-22 Thread Shai Erera
Hmm .. so if I want to add totalTermFreq to the response, it will break the
current output format of TermsComponent, which returns for each term only
the docFreq. What's our BWC policy for such API and is there a way to
handle it?

I can add a new terms.ttf parameter, and so if you set it to true, the
response will look different (each term will have both docFreq and
totalTermFreq elements), but if you didn't, you will get the same response.
Is that acceptable?

Somewhat related, but can be handled separately, I noticed that if you
specify terms.list and multiple terms.fl parameters, you only receive stats
for the first field (the rest are ignored), but if you don't specify
terms.list, you get results for all fields. I don't see any reason not to
support multiple fields with terms list, what do you think?

On Wed, Feb 22, 2017 at 10:08 PM Shai Erera <ser...@gmail.com> wrote:

> Looks like this could be a very easy addition to TermsComponent? From what
> I read in the code, it uses TermContext to compute/hold the stats, and the
> latter already has docFreq and totalTermFreq (!!). It's just that
> TermsComponent does not output TTF (only computes it...):
>
> for(int i=0; i<terms.length; i++) {
>   if(termContexts[i] != null) {
> String outTerm =
> fieldType.indexedToReadable(terms[i].bytes().utf8ToString());
> int docFreq = termContexts[i].docFreq();
> termsMap.add(outTerm, docFreq);
>   }
> }
>
>
> On Wed, Feb 22, 2017 at 5:34 PM Joel Bernstein <joels...@gmail.com> wrote:
>
> Yeah, I think expanding the functionality of the terms component looks
> like the right place to add these stats.
>
> I plan on exposing these types of terms stats as Streaming Expression
> functions but I would likely use the terms component under the covers.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera <ser...@gmail.com> wrote:
>
> No, they are not global distributed stats. I am willing to live with
> approximated stats though (unless again, there's an API which can give me
> both). I wonder why doesn't Terms component return ttf in addition to
> docfreq. The API (at the Lucene level) is right there already.
>
> On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein <joels...@gmail.com> wrote:
>
> Hi Shai,
>
> Do ttf and docfreq return global stats in distributed mode? I wasn't aware
> that there was a mechanism for aggregating values in the field list.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera <ser...@gmail.com> wrote:
>
> Hi
>
> I am currently using function queries to obtain these two statistics, as I
> didn't see a better or more explicit API and the Terms component only
> returns docFreq, but not totalTermFreq.
>
> The way I use the API is submit requests as follows:
>
> curl "
> http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1
> ')"
>
> Today I noticed that it sometimes returns 0 for these stats for existing
> terms. After debugging and going through the code, I noticed that it
> performs analysis on the value that's given. So if I provide an already
> stemmed value, it analyzes the value further and in some cases it results
> in a non-existing term (and in other cases I get stats for a term I didn't
> ask for).
>
> I want to get the stats of the indexed version of the terms, and that's
> why I send the already stemmed one. In my case I tried to get the stats for
> the term 'disguis' which is the stem of 'disguise' and 'disguised', however
> it further analyzed the value to 'disgui' (per the analysis chain) and that
> term does not exist in the index.
>
> So first question is -- is this the right API to retrieve such statistics?
> I didn't find another one, but could be I missed it.
>
> If it is, why does it analyze the value? I tried to wrap the value with
> single and double quotes, but of course that does not affect the analysis
> ... is analysis an intended behavior or a bug?
>
> Shai
>
>
>
>


Re: Getting totalTermFreq and docFreq for terms

2017-02-22 Thread Shai Erera
Looks like this could be a very easy addition to TermsComponent? From what
I read in the code, it uses TermContext to compute/hold the stats, and the
latter already has docFreq and totalTermFreq (!!). It's just that
TermsComponent does not output TTF (only computes it...):

for(int i=0; i<terms.length; i++) {
  if(termContexts[i] != null) {
String outTerm =
fieldType.indexedToReadable(terms[i].bytes().utf8ToString());
int docFreq = termContexts[i].docFreq();
termsMap.add(outTerm, docFreq);
  }
}


On Wed, Feb 22, 2017 at 5:34 PM Joel Bernstein <joels...@gmail.com> wrote:

> Yeah, I think expanding the functionality of the terms component looks
> like the right place to add these stats.
>
> I plan on exposing these types of terms stats as Streaming Expression
> functions but I would likely use the terms component under the covers.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera <ser...@gmail.com> wrote:
>
> No, they are not global distributed stats. I am willing to live with
> approximated stats though (unless again, there's an API which can give me
> both). I wonder why doesn't Terms component return ttf in addition to
> docfreq. The API (at the Lucene level) is right there already.
>
> On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein <joels...@gmail.com> wrote:
>
> Hi Shai,
>
> Do ttf and docfreq return global stats in distributed mode? I wasn't aware
> that there was a mechanism for aggregating values in the field list.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera <ser...@gmail.com> wrote:
>
> Hi
>
> I am currently using function queries to obtain these two statistics, as I
> didn't see a better or more explicit API and the Terms component only
> returns docFreq, but not totalTermFreq.
>
> The way I use the API is submit requests as follows:
>
> curl "
> http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1
> ')"
>
> Today I noticed that it sometimes returns 0 for these stats for existing
> terms. After debugging and going through the code, I noticed that it
> performs analysis on the value that's given. So if I provide an already
> stemmed value, it analyzes the value further and in some cases it results
> in a non-existing term (and in other cases I get stats for a term I didn't
> ask for).
>
> I want to get the stats of the indexed version of the terms, and that's
> why I send the already stemmed one. In my case I tried to get the stats for
> the term 'disguis' which is the stem of 'disguise' and 'disguised', however
> it further analyzed the value to 'disgui' (per the analysis chain) and that
> term does not exist in the index.
>
> So first question is -- is this the right API to retrieve such statistics?
> I didn't find another one, but could be I missed it.
>
> If it is, why does it analyze the value? I tried to wrap the value with
> single and double quotes, but of course that does not affect the analysis
> ... is analysis an intended behavior or a bug?
>
> Shai
>
>
>
>


Re: Getting totalTermFreq and docFreq for terms

2017-02-22 Thread Shai Erera
No, they are not global distributed stats. I am willing to live with
approximated stats though (unless again, there's an API which can give me
both). I wonder why doesn't Terms component return ttf in addition to
docfreq. The API (at the Lucene level) is right there already.

On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein <joels...@gmail.com> wrote:

> Hi Shai,
>
> Do ttf and docfreq return global stats in distributed mode? I wasn't aware
> that there was a mechanism for aggregating values in the field list.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera <ser...@gmail.com> wrote:
>
> Hi
>
> I am currently using function queries to obtain these two statistics, as I
> didn't see a better or more explicit API and the Terms component only
> returns docFreq, but not totalTermFreq.
>
> The way I use the API is submit requests as follows:
>
> curl "
> http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1
> ')"
>
> Today I noticed that it sometimes returns 0 for these stats for existing
> terms. After debugging and going through the code, I noticed that it
> performs analysis on the value that's given. So if I provide an already
> stemmed value, it analyzes the value further and in some cases it results
> in a non-existing term (and in other cases I get stats for a term I didn't
> ask for).
>
> I want to get the stats of the indexed version of the terms, and that's
> why I send the already stemmed one. In my case I tried to get the stats for
> the term 'disguis' which is the stem of 'disguise' and 'disguised', however
> it further analyzed the value to 'disgui' (per the analysis chain) and that
> term does not exist in the index.
>
> So first question is -- is this the right API to retrieve such statistics?
> I didn't find another one, but could be I missed it.
>
> If it is, why does it analyze the value? I tried to wrap the value with
> single and double quotes, but of course that does not affect the analysis
> ... is analysis an intended behavior or a bug?
>
> Shai
>
>
>


Getting totalTermFreq and docFreq for terms

2017-02-22 Thread Shai Erera
Hi

I am currently using function queries to obtain these two statistics, as I
didn't see a better or more explicit API and the Terms component only
returns docFreq, but not totalTermFreq.

The way I use the API is submit requests as follows:

curl "
http://localhost:8983/solr/mycollection/select?q=*:*=1=ttf(text,'t1'),docfreq(text,'t1
')"

Today I noticed that it sometimes returns 0 for these stats for existing
terms. After debugging and going through the code, I noticed that it
performs analysis on the value that's given. So if I provide an already
stemmed value, it analyzes the value further and in some cases it results
in a non-existing term (and in other cases I get stats for a term I didn't
ask for).

I want to get the stats of the indexed version of the terms, and that's why
I send the already stemmed one. In my case I tried to get the stats for the
term 'disguis' which is the stem of 'disguise' and 'disguised', however it
further analyzed the value to 'disgui' (per the analysis chain) and that
term does not exist in the index.

So first question is -- is this the right API to retrieve such statistics?
I didn't find another one, but could be I missed it.

If it is, why does it analyze the value? I tried to wrap the value with
single and double quotes, but of course that does not affect the analysis
... is analysis an intended behavior or a bug?

Shai


Re: Welcome Cao Manh Dat as a Lucene/Solr committer

2017-01-09 Thread Shai Erera
Welcome!

On Mon, Jan 9, 2017, 21:37 Michael McCandless 
wrote:

> Welcome!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Jan 9, 2017 at 10:57 AM, Joel Bernstein 
> wrote:
> > I'm pleased to announce that Cao Manh Dat has accepted the Lucene
> > PMC's invitation to become a committer.
> >
> > Dat, it's tradition that you introduce yourself with a brief bio.
> >
> > Your account has been added to the “lucene" LDAP group, so you
> > now have commit privileges. Please test this by adding yourself to the
> > committers section of the Who We Are page on the website:
> >  (instructions here
> > ).
> >
> > The ASF dev page also has lots of useful links:
> > .
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Commented] (LUCENE-7590) Add DocValues statistics helpers

2016-12-20 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766325#comment-15766325
 ] 

Shai Erera commented on LUCENE-7590:


[~shia] where do you see that? I checked master and there's no {{description}} 
in the file at all. Here's the code:

{code}
public LongDocValuesStats(String field) {
  super(field, Long.MAX_VALUE, Long.MIN_VALUE);
}
{code}

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7590-2.patch, LUCENE-7590-sorted-numeric.patch, 
> LUCENE-7590-sorted-set.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7590) Add DocValues statistics helpers

2016-12-18 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-7590.

   Resolution: Fixed
Fix Version/s: 6.4
   master (7.0)

Committed to master and 6x. This is now complete.

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7590-2.patch, LUCENE-7590-sorted-numeric.patch, 
> LUCENE-7590-sorted-set.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7590) Add DocValues statistics helpers

2016-12-18 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-7590:
---
Attachment: LUCENE-7590-sorted-set.patch

Patch adds {{SortedDocValuesStats}} and {{SortedSetDocValuesStats}} for sorted 
and sorted-set DV fields. With this patch, I think the issue is ready to be 
closed. I am not sure that we need a DVStats for a BinaryDVField at this point, 
but if demand arises, it should be easy to add.

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590-2.patch, LUCENE-7590-sorted-numeric.patch, 
> LUCENE-7590-sorted-set.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7590) Add DocValues statistics helpers

2016-12-17 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-7590:
---
Attachment: LUCENE-7590-sorted-numeric.patch

Patch adds DVStats for {{SortedNumericDocValuesField}}.

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590-2.patch, LUCENE-7590-sorted-numeric.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7590) Add DocValues statistics helpers

2016-12-15 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-7590:
---
Attachment: LUCENE-7590-2.patch

Patch adds {{sum}}, {{stdev}} and {{variance}} stats to 
{{NumericDocValuesStats}}. I also added a CHANGES entry which I forgot to in 
the previous commit.

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590-2.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7590) Add DocValues statistics helpers

2016-12-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15748192#comment-15748192
 ] 

Shai Erera commented on LUCENE-7590:


There are now few tasks left:

* Add more statistics, such as {{sum}} and {{stdev}} (for numeric fields). 
Should we care about overflow, or only document it?

* We can also compute more stats like what Solr gives in [Stats 
Component|https://cwiki.apache.org/confluence/display/solr/The+Stats+Component#TheStatsComponent-StatisticsSupported].
 What do you think?

* Add stats for {{SortedDocValues}}. This should be fairly straightforward by 
comparing the {{BytesRef}} of all matching documents. But I don't think we 
should have a {{mean}} stat for it? Likewise for {{SortedSetDocValues}}.

* What should we do with {{SortedNumericDocValues}}? {{min}} and {{max}} are 
well defined, but what about {{mean}}? Should it be across all values?

I intend to close this issue and handle the rest in follow-on issues, unless 
you think otherwise. Also, would appreciate your feedback on the above points.

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7590) Add DocValues statistics helpers

2016-12-14 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-7590:
---
Attachment: LUCENE-7590.patch

Patch changes {{DocValuesIterator}} package-private again and adds an API to 
{{DocValuesStats}} to help in determining whether a document has or does not 
have a value for the field.

The Collector needs to be public because you're supposed to initialize it and 
run a search with it.

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7590) Add DocValues statistics helpers

2016-12-13 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-7590:
---
Attachment: LUCENE-7590.patch

[~jpountz] I accept your proposal about missing, only in case a reader does not 
have the requested DV field, the collector returns a {{LeafCollector}} which 
updates {{missing}} for every hit document.

I also renamed the classes as proposed earlier, as well extracted 
{{DocValuesStats}} and friends to its own class.

I still didn't address changing {{DocValuesIterator}} to public. BTW, I noticed 
that {{SimpleTextDocValuesReader}} defines a private class named 
{{DocValuesIterator}} with exactly the same signature, I assume because the 
other one is package-private. So I feel that changing {{DVI}} to public is 
beneficial beyond the scope of this issue alone. What do you think?

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7590) Add DocValues statistics helpers

2016-12-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745855#comment-15745855
 ] 

Shai Erera commented on LUCENE-7590:


bq. Instead of using a NOOP_COLLECTOR, you could throw a 
CollectionTerminatedException

OK, good idea.

bq. By the way, in such cases I think we should still increase the missing 
count?

I am not sure? I mean, {{missing}} represents all the documents that matched 
the query and did not have a value for that DV field. But when 
{{getLeafCollector}} is called, we don't know yet that any documents will be 
matched by the query at all (I think?) and therefore updating missing might be 
confusing? I.e., I'd expect that if anyone chained {{TotalHitsCollector}} with 
{{DocValuesStatsCollector}}, then {{totalHits = stats.count() + 
stats.missing()}}? I am open to discuss it, just not sure I always want to 
update missing with {{context.reader().numDocs()}} ...

bq. Can we avoid making DocValuesIterator public?

I did not find a way, since it's part of {{DocValuesStats.init()}} API and I 
think users should be able to provide their own {{Stats}} impl, e.g. if they 
want to compute something on a {{BinaryDocValues}} field?

Here too, I'd love to get more ideas though. I tried to avoid implementing N 
collectors, one for each DV type, where they share a large portion of the code. 
But if you have strong opinions about making {{DVI}} public, maybe that's what 
we should do ...

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7590) Add DocValues statistics helpers

2016-12-13 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-7590:
---
Attachment: LUCENE-7590.patch

Added tests for {{DoubleNumericDocValuesStats}}.

Now that I review the class names, how do you feel about removing {{Numeric}} 
from the concrete classes, so they're called {{Long/DoubleDocValuesStats}}?

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch, LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7590) Add DocValues statistics helpers

2016-12-13 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-7590:
---
Attachment: LUCENE-7590.patch

Patch implements a {{DocValuesStatsCollector}}. Note some key design decisions:

A {{DocValuesStats}} is responsible for providing the specific 
{{DocValuesIterator}} for a {{LeafReaderContext}}. It then accumulates the 
value, computes missing and other statistics. It computes {{missing}} and 
{{count}}, leaving {{min}} and {{max}} to the actual implementation. Also, this 
stats does not define a {{mean}}, as at least for now I'm not sure how the mean 
value of a {{SortedSetDocValues}} is defined.

An abstract {{NumericDocValuesStats}} implementation for single-numeric DV 
fields, which also adds a {{mean}} statistic, with two concrete 
implementations: {{LongNumericDocValuesStats}} and 
{{DoubleNumericDocValuesStats}}.

This hierarchy should allow us to add further statistics for {{SortedSet}} and 
{{SortedNumeric}} DV fields. I did not implement them yet, as I'm not sure 
about some of the statistics (e.g. should the {{mean}} stat of a 
{{SortedNumeric}} be the mean across all values, or the minimum per document or 
...). Let's discuss that separately.

Also, note that I had to make {{DocValuesIterator}} public in order to declare 
it in this collector.

If you're OK with the design and implementation, I want to separate 
{{DovValuesStats}} to its own file, for clarity. I did not do it yet though.

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch, 
> LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7590) Add DocValues statistics helpers

2016-12-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15743089#comment-15743089
 ] 

Shai Erera commented on LUCENE-7590:


bq. Let's implement the computation of these stats by writing a Collector and 
use a MatchAllDocsQuery?

At first I thought this is an overkill, but a {{Collector}} will allow 
computing them for documents that match another query. I will explore that 
option.

bq. Why is missing undefined when count is zero?

I thought that if you have no documents in the index at all, then {{missing}} 
is undefined, but now that you ask the question, I guess in that case it's fine 
if it's {{0}}, like {{count}}. I'll change the docs.

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7590) Add DocValues statistics helpers

2016-12-12 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-7590:
---
Attachment: LUCENE-7590.patch

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590.patch, LUCENE-7590.patch, LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7590) Add DocValues statistics helpers

2016-12-12 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-7590:
---
Attachment: LUCENE-7590.patch

Thanks [~mikemccand] and [~thetaphi], I changed to a static class and removed 
{{DocsAndContexts}} in favor of a new 
{{Function<LeafReaderContext,DocIdSetIterator>}}.

Maybe {{BitsDocIdSetIterator}} can go in separately (i.e. a separate issue)? As 
I think it's a useful utility to have anyway.

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>Reporter: Shai Erera
>Assignee: Shai Erera
> Attachments: LUCENE-7590.patch, LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7590) Add DocValues statistics helpers

2016-12-11 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-7590:
---
Attachment: LUCENE-7590.patch

First patch adds numeric statistics. I'd appreciate comments about it before I 
add support for sorted-numeric (including, whether we should!).

Note that I chose to take either a field or {{ValueSource}}. The latter gives 
some flexibility by allowing users to pass an arbitrary VS over e.g. an 
{{Expression}} over a numeric DV field.

This, as far as I could tell, does not apply to {{SortedNumericDV}}, or at 
least I couldn't find an existing {{ValueSource}} implementation (like 
{{LongFieldSource}}) for {{SortedNumericDV}}.

If this approach looks good, I'd like to refactor the class so that it's easy 
to share/reuse code between Long and Double NDV fields.

> Add DocValues statistics helpers
> 
>
> Key: LUCENE-7590
> URL: https://issues.apache.org/jira/browse/LUCENE-7590
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/misc
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Attachments: LUCENE-7590.patch
>
>
> I think it can be useful to have DocValues statistics helpers, that can allow 
> users to query for the min/max/avg etc. stats of a DV field. In this issue 
> I'd like to cover numeric DV, but there's no reason not to add it to other DV 
> types too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7590) Add DocValues statistics helpers

2016-12-11 Thread Shai Erera (JIRA)
Shai Erera created LUCENE-7590:
--

 Summary: Add DocValues statistics helpers
 Key: LUCENE-7590
 URL: https://issues.apache.org/jira/browse/LUCENE-7590
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/misc
Reporter: Shai Erera
Assignee: Shai Erera


I think it can be useful to have DocValues statistics helpers, that can allow 
users to query for the min/max/avg etc. stats of a DV field. In this issue I'd 
like to cover numeric DV, but there's no reason not to add it to other DV types 
too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Ishan Chattopadhyaya as Lucene/Solr committer

2016-11-29 Thread Shai Erera
Congratulations!

On Wed, Nov 30, 2016, 01:02 Alexandre Rafalovitch 
wrote:

> Congratulations and welcome Ishan.
>
> I am happy to stop being the "new boy" of this distinguished club :-)
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 30 November 2016 at 05:17, Mark Miller  wrote:
> > I'm pleased to announce that Ishan Chattopadhyaya has accepted the PMC's
> > invitation to become a committer.
> >
> > Ishan, it's tradition that you introduce yourself with a brief bio /
> > origin story, explaining how you arrived here.
> >
> > Your handle "ishan" has already added to the “lucene" LDAP group, so
> > you now have commit privileges.
> >
> > Please celebrate this rite of passage, and confirm that the right
> > karma has in fact enabled, by embarking on the challenge of adding
> > yourself to the committers section of the Who We Are page on the
> > website: http://lucene.apache.org/whoweare.html (use the ASF CMS
> > bookmarklet
> > at the bottom of the page here: https://cms.apache.org/#bookmark -
> > more info here http://www.apache.org/dev/cms.html).
> >
> > Congratulations and welcome!
> > --
> > - Mark
> > about.me/markrmiller
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-08-10 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414871#comment-15414871
 ] 

Shai Erera commented on LUCENE-7344:


bq. I don't understand most of what you're saying

To clarify the problem, both for you but also in the interest of writing a 
detailed plan of the proposed solution: currently when a DBQ is processed, it 
uses the LeafReader *without* the NDV updates, and therefore has no knowledge 
of the updated values. This is relatively easily solved in the patch I 
uploaded, by applying the DV updates before the DBQ is processed. That way, the 
DBQ uses a LeafReader which is already aware of the updates and all works well.

However, there is an order of update operations that occur in IndexWriter. In 
our case it could be a mix in of DBQ and NDV updates. So if we apply *all* the 
DV updates before any of the DBQs, we'll get incorrect results where the DBQ 
either delete a document it shouldn't (see code example above, and also what 
your {{testDeleteFollowedByUpdateOfDeletedValue}} shows), or not delete a 
document that it should.

To properly solve this problem, we need to apply the DV updates and DBQs in the 
order they were received (as opposed to applying them in bulk in current code). 
Meaning if the order of operations is NDVU1, NDVU2, DBQ1, NDVU3, DBQ2, DBQ3, 
NDVU4, then we need to:
# Apply NDVU1 + NDVU2; this will cause a new LeafReader to be created
# Apply DBQ1; using the already updated LeafReader
# Apply NDVU3; another LeafReader will be created, now reflecting all 3 NDV 
updates
# Apply DBQ2 and DBQ3; using the updated LeafReader from above
# Apply NDVU4; this will cause another LeafReader to be created

The adversarial affect in this case is that we cause 3 LeafReader reopens, each 
time (due to how NDV updates are currently implemented) writing the full DV 
field to a new stack. If you have many documents, it's going to be very 
expensive. Also, if you have a bigger sequence of interleaving updates and 
deletes, this gets worse and worse.

And so here comes the optimization that Mike and I discussed above. Since the 
NDV updates are held in-memory until they're applied, we can avoid flushing 
them to disk and creating a LeafReader which reads the original DV field + the 
in-memory DV updates. Note though: not *all* DV updates, but only the ones that 
are relevant up until this point. So in the case above, that LeafReader will 
view only NDVU1 and NDVU2, and later it will be updated to view NDVU3 as well.

This is purely an optimization step and has nothing to do with correctness (of 
course, that optimization is tricky and needs to be implemented correctly!). 
Therefore my plan of attack in this case is:

# Have enough tests that try different cases before any of this is implemented. 
For example, Mike proposed above to have the LeafReader + DV field "view" use 
docIdUpto. I need to check the code again, but I want to make sure that if 
NDVU2, NDVU3 and NDVU4 (with the interleaving DBQs) all affect the *same* 
document, everything still works.
# Implement the less-efficient approach, i.e. flush the DV updates to disk 
before each DBQ is processed. This ensures that we have a proper solution 
implemented, and we leave the optimization to a later step (either literally a 
later commit, or just a different patch or whatever). I think this is 
complicated enough to start with.
# Improve the solution to avoid flushing DV updates between the DBQs, as 
proposed above.

bq. testBiasedMixOfRandomUpdates

I briefly reviewed the test, but not thoroughly (I intend to). However, notice 
that committing (hard/soft ; commit/NRT) completely avoids the problem because 
a commit/NRT already means flushing DV updates. So if that's what this test 
does, I don't think it's going to expose the problem. Perhaps with the 
explanation I wrote above, you can revisit the test and make it fail though.

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5944) Support updates of numeric DocValues

2016-08-09 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413974#comment-15413974
 ] 

Shai Erera commented on SOLR-5944:
--

[~ichattopadhyaya], I've continued the discussion in LUCENE-7344. I am looking 
into fixing the bug, though it's a hairy one...

> Support updates of numeric DocValues
> 
>
> Key: SOLR-5944
> URL: https://issues.apache.org/jira/browse/SOLR-5944
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ishan Chattopadhyaya
>Assignee: Shalin Shekhar Mangar
> Attachments: DUP.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, 
> TestStressInPlaceUpdates.eb044ac71.beast-167-failure.stdout.txt, 
> TestStressInPlaceUpdates.eb044ac71.beast-587-failure.stdout.txt, 
> TestStressInPlaceUpdates.eb044ac71.failures.tar.gz, 
> hoss.62D328FA1DEA57FD.fail.txt, hoss.62D328FA1DEA57FD.fail2.txt, 
> hoss.62D328FA1DEA57FD.fail3.txt, hoss.D768DD9443A98DC.fail.txt, 
> hoss.D768DD9443A98DC.pass.txt
>
>
> LUCENE-5189 introduced support for updates to numeric docvalues. It would be 
> really nice to have Solr support this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-08-09 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413364#comment-15413364
 ] 

Shai Erera commented on LUCENE-7344:


After chatting with Mike about this, here's an example for an "interleaving" 
case that Mike mentioned, where this patch does not work:

{code}
writer.updateNumericDocValue(new Term("id", "doc-1"), "val", 17L);
writer.deleteDocuments(DocValuesRangeQuery.newLongRange("val", 5L, 10L, 
true, true));
writer.updateNumericDocValue(new Term("id", "doc-1"), "val", 7L);
{code}

Here, "doc-1" should not be deleted, because the DBQ is submitted before the DV 
update, but because we resolve all DV updates before DBQ (in this patch), it 
ends up deleted. This is wrong of course. I'm looking into Mike's other idea of 
having a LeafReader view with the DV updates up until that document, and then 
ensuring DV updates / DBQs are applied in the order they were submitted. This 
starts to get very complicated.

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch, LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-08-09 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-7344:
---
Attachment: LUCENE-7344.patch

Patch applies the DBQ after resolving the DV updates. With this patch, if there 
were DV updates, then the SegmentState is updated with the up-to-date reader.

Note that this does not mean more work compared to what was done before -- if 
there are no DV updates, writeFieldUpdates isn't called, and no reader is 
updated. If there were field updates, then writeFieldUpdates was called anyway, 
refreshing the internal reader.

This patch does not change the behavior, except it also updates the 
SegmentState.reader if there were DV updates.

[~mikemccand] what do you think? Our SegmentReader already only refreshes the 
DV updates, that is it already maintains a view of the bare segment with the 
modified DV fields. Also, given what I wrote above, I don't believe that we're 
making more SR reopens?

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch, LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-08-09 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413216#comment-15413216
 ] 

Shai Erera commented on LUCENE-7344:


Hmm ... I had to refresh my memory of the DV updates code and I agree with 
[~mikemccand] that the fix is hairy (which goes hand-in-hand with the hairy 
{{BufferedUpdatesStream}}). The problem is that deleteByQuery uses the existing 
LeafReader, but the DV updates themselves were not yet applied so the reader is 
unaware of the change.

I changed the test to call {{updateDocument}} instead of updating the NDV and 
the test passes. This is expected because updating a document deletes the old 
one and adds a new document. So when DBQ is processed, a LeafReader is opened 
on the new segment (with the new document; it has to work that way cause the 
new document isn't yet flushed) and the new segment thus has the new document 
with the updated NDV.

I agree this is a bug *only* because updating a document followed by DBQ works 
as expected. The internals of how in-place updates are applied should not 
concern the user.

I wonder if we need to implement a complex merge-sorting approach as 
[~mikemccand] proposes, or if we applied the DV updates before processing and 
DBQ would be enough (ignoring the adversarial affects that Mike describes; 
they're true, but I ignore them for the moment). I want to try that.

If that works, then perhaps we can detect if a DBQ involves an NDV field (or 
BDV field for that matter) and refresh the reader only then, or refresh the 
reader whenever there are DBQ and any DV updates, even if they are unrelated. 
But first I want to try and make the test pass, before we decide on how to 
properly fix it.

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Alexandre Rafalovitch as a Lucene/Solr committer!

2016-08-07 Thread Shai Erera
Welcome Alexandre and congratulations!

On Sun, Aug 7, 2016 at 9:51 PM Dawid Weiss  wrote:

> Welcome Alexandre!
>
> Dawid
>
> On Sun, Aug 7, 2016 at 7:16 PM, Alan Woodward  wrote:
> > Welcome Alexandre!
> >
> > Alan Woodward
> > www.flax.co.uk
> >
> >
> > On 7 Aug 2016, at 15:38, Uwe Schindler wrote:
> >
> > Welcome Alexandre!
> >
> > Uwe
> >
> > Am 7. August 2016 00:49:50 MESZ, schrieb "Jan Høydahl"
> > :
> >
> > I'm pleased to announce that Alexandre Rafalovitch has accepted the
> >
> > Lucene PMC's invitation to become a committer.
> >
> >
> > Alexandre, it's tradition that you introduce yourself with a brief bio.
> >
> >
> > Your handle "arafalov" is already added to the “lucene" LDAP group, so
> >
> > you now have commit privileges. Please test this by adding yourself to
> >
> > the committers section of the Who We Are page on the website:
> >
> >  (instructions here
> >
> > ).
> >
> >
> > The ASF dev page also has lots of useful links:
> >
> > .
> >
> >
> > Congratulations and welcome!
> >
> >
> > --
> >
> > Jan Høydahl
> >
> > -
> >
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >
> > --
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, 28213 Bremen
> > http://www.thetaphi.de
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Release Solr 5.5.3

2016-07-20 Thread Shai Erera
Right, it affects SSL-enabled Solr more quickly. I do not know if it
eventually affects non-SSL Solr, I guess not ...

Also, it's not about high indexing rates, as it's reproduced with 9
docs/sec, IIRC.

Anyway, it certainly affected several users already, so I'm +1 for the
release.

Shai

On Wed, Jul 20, 2016 at 11:35 PM David Smiley 
wrote:

> Okay.  BTW SOLR-9290 isn't "Just" high indexing rates, but it's for those
> using SSL too -- correct me if I'm wrong.  We don't want to raise alarm
> bells too loudly :-)
>
> On Wed, Jul 20, 2016 at 4:18 PM Anshum Gupta 
> wrote:
>
>> Hi,
>>
>> With SOLR-9290 fixed, I think it calls for a bug fix release as it
>> impacts all users with high indexing rates.
>>
>> If someone else wants to work on the release, I am fine with it else,
>> I'll be happy to be the RM and cut an RC a week from now.
>>
>> --
>> Anshum Gupta
>>
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>


[jira] [Commented] (SOLR-9319) DELETEREPLICA should accept just count and remove replicas intelligenty

2016-07-19 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384351#comment-15384351
 ] 

Shai Erera commented on SOLR-9319:
--

Thanks [~noble.paul]. The issue description is a bit misleading (_should accept 
*just* count_) but thanks for clarifying.

> DELETEREPLICA should accept  just count and remove replicas intelligenty
> 
>
> Key: SOLR-9319
> URL: https://issues.apache.org/jira/browse/SOLR-9319
> Project: Solr
>  Issue Type: Sub-task
>  Components: SolrCloud
>Reporter: Noble Paul
> Fix For: 6.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9319) DELETEREPLICA should accept just count and remove replicas intelligenty

2016-07-19 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384301#comment-15384301
 ] 

Shai Erera commented on SOLR-9319:
--

What does "just count" mean? Will I not be able to delete a specific replica, 
or is this in addition to being able to delete a selected replica? I think that 
having an API like "delete replicas such that only X remain" is fine, but I 
would like to also be able to specify which replica I want to delete (since in 
my case I need to control that).

> DELETEREPLICA should accept  just count and remove replicas intelligenty
> 
>
> Key: SOLR-9319
> URL: https://issues.apache.org/jira/browse/SOLR-9319
> Project: Solr
>  Issue Type: Sub-task
>  Components: SolrCloud
>Reporter: Noble Paul
> Fix For: 6.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

2016-07-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15376474#comment-15376474
 ] 

Shai Erera commented on SOLR-9290:
--

bq. Which begs the question: why are there 15 CLOSE_WAIT connections that last 
forever on branch_6x even with this patch?

I think Shalin's patch only adds this monitor thread to {{UpdateShardHandler}}, 
but not to {{HttpShardHandlerFactory}} so these 15 could be from it?

> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> --
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.1, 5.5.2
>Reporter: Anshum Gupta
>Priority: Critical
> Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, index.sh, 
> setup-solr.sh, setup-solr.sh
>
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

2016-07-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375601#comment-15375601
 ] 

Shai Erera commented on SOLR-9290:
--

Oh I see. So we didn't experience the problem because we run w/ 2 replicas (and 
one shard currently) and with 5.4.1's settings the math for us results in a low 
number of connections. But someone running a larger Solr deployment could 
already hit that problem prior to 5.5. Thanks for the clarification!

> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> --
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.1, 5.5.2
>Reporter: Anshum Gupta
>Priority: Critical
> Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, 
> setup-solr.sh
>
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

2016-07-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375582#comment-15375582
 ] 

Shai Erera commented on SOLR-9290:
--

Regarding the patch, the monitor looks good. Few comments:

* I prefer that we name it {{IdleConnectionsMonitor}} (w/ 's', plural 
connections). It goes for the class, field and thread name.
* Do you intend to keep all the log statements around?
* Do you think we should make the polling interval (10s) and 
idle-connections-time (50s) configurable? Perhaps through system properties?

> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> --
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.1, 5.5.2
>Reporter: Anshum Gupta
>Priority: Critical
> Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, 
> setup-solr.sh
>
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

2016-07-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375571#comment-15375571
 ] 

Shai Erera commented on SOLR-9290:
--

bq. Do you have only two replicas? Perhaps the maxConnectionsPerHost limit of 
100 is kicking in?

Yes, we do have only 2 replicas and I get why the CLOSE_WAITs stop at 100. I 
was asking about 5.3.2 -- how could CLOSE_WAITs get high in 5.3.2 when 
maxConnectionsPerHost was the same as in 5.4.1?

> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> --
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.1, 5.5.2
>Reporter: Anshum Gupta
>Priority: Critical
> Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, 
> setup-solr.sh
>
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

2016-07-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375552#comment-15375552
 ] 

Shai Erera commented on SOLR-9290:
--

bq. I thought that hypothesis holds only after SOLR-8533. Are you saying you 
also saw it on 5.3.2? If so, what are the values that are set for these 
properties there? We definitely do not see the problem with 5.4.1, but we 
didn't test prior versions.

We posted at the same time, I read your answer above. I wonder why we don't see 
the problem with 5.4.1. I mean, we do see CLOSE_WAITs piling, but stop at ~100 
(200 for the leader).

> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> --
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.1, 5.5.2
>Reporter: Anshum Gupta
>Priority: Critical
> Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, 
> setup-solr.sh
>
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

2016-07-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375548#comment-15375548
 ] 

Shai Erera commented on SOLR-9290:
--

Thanks [~shalinmangar]. Few questions:

bq. Also, I think the reason this wasn't reproducible on master is because 
SOLR-4509 enabled eviction of idle threads by calling 
HttpClientBuilder#evictIdleConnections with a 50 second limit.

Is this something we can apply to 5x/6x too?

bq. This patch adds a monitor thread for the pool created in UpdateShardHandler 
and with this applied

I didn't see the monitor in the latest patch, only the log printouts. Did you 
forget to add it?

bq. There are still a few connections in CLOSE_WAIT at steady state but I 
verified that they belong to a different HttpClient instance in 
HttpShardHandlerFactory and other places.

(1) Can/Should we have a similar monitor for HttpShardHandlerFactory?
(2) Any reason why the two don't share the same HttpClient instance?

bq. This patch applies on 5.3.2
bq. We have a large limit for maxConnections and maxConnectionsPerHost

I thought that hypothesis holds only after SOLR-8533. Are you saying you also 
saw it on 5.3.2? If so, what are the values that are set for these properties 
there? We definitely *do not* see the problem with 5.4.1, but we didn't test 
prior versions.

> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> --
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.1, 5.5.2
>Reporter: Anshum Gupta
>Priority: Critical
> Attachments: SOLR-9290-debug.patch, SOLR-9290-debug.patch, 
> setup-solr.sh
>
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

2016-07-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375500#comment-15375500
 ] 

Shai Erera commented on SOLR-9290:
--

Thanks [~yo...@apache.org], I'll read the issue.

I agree with what you write in general, but we do hit an issue with these 
settings. That that it reproduces easily with SSL enabled suggests that the 
issue may not be in Solr code at all, but I wonder if we shouldn't perhaps pick 
smaller default values if SSL is enabled? (Our guess at the moment is that HC 
keeps more connections in the pool when SSL is enabled because they are more 
expensive to initiate, but it's just a guess).

And maybe the proper solution would be what [~shalinmangar] wrote above -- have 
a bg monitor which closes idle/expired connections. I actually wonder why it 
can't be a property of {{ClientConnectionManager}} that you can set to auto 
close idle/expired connections after a period of time. We can potentially have 
that monitor act only if SSL is enabled (or at least until non-SSL exhibits the 
same problems too).

> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> --
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.1, 5.5.2
>Reporter: Anshum Gupta
>Priority: Critical
> Attachments: SOLR-9290-debug.patch, setup-solr.sh
>
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

2016-07-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375184#comment-15375184
 ] 

Shai Erera commented on SOLR-9290:
--

Also [~markrmil...@gmail.com], for education purposes, if you have a link to a 
discussion about why it may lead to a distributed deadlock, I'd be happy to 
read it.

> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> --
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.1, 5.5.2
>Reporter: Anshum Gupta
>Priority: Critical
> Attachments: SOLR-9290-debug.patch, setup-solr.sh
>
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

2016-07-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375151#comment-15375151
 ] 

Shai Erera commented on SOLR-9290:
--

Thanks [~markrmil...@gmail.com]. In that case, what's your take on the issue at 
hand?

> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> --
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.1, 5.5.2
>Reporter: Anshum Gupta
>Priority: Critical
> Attachments: SOLR-9290-debug.patch, setup-solr.sh
>
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

2016-07-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375082#comment-15375082
 ] 

Shai Erera commented on SOLR-9290:
--

An update -- I've modified our solr.xml (which is basically the vanilla 
solr.xml) with these added props (under the {{solrcloud}} element) and I do not 
see the connections spike anymore:

{noformat}
1
100
{noformat}

Those changes were part of SOLR-8533. [~markrmil...@gmail.com] on that issue 
you didn't explain why the defaults need to be set that high. Was there perhaps 
an email thread you can link to which includes more details? I ask because one 
thing I've noticed is that if I query {{solr/admin/info/system}}, the 
{{system.openFileDescriptorCount}} is very high when there are many 
CLOSE_WAITs. Such a change in Solr default probably need to be accompanied by 
an OS-level setting too, no?

I am still running tests with those props set in solr.xml, on top of 5.5.1. 
[~mbjorgan] would you mind testing in your environment too?

[~hoss...@fucit.org], sorry I completely missed your questions. Our solr.xml is 
the vanilla one, we didn't modify anything in it. We did uncomment the SSL 
props in solr.in.sh as the ref guide says, but aside from the key name and 
password, we didn't change any settings.

> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> --
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.1, 5.5.2
>Reporter: Anshum Gupta
>Priority: Critical
> Attachments: SOLR-9290-debug.patch, setup-solr.sh
>
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9290) TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled

2016-07-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374779#comment-15374779
 ] 

Shai Erera commented on SOLR-9290:
--

bq. Interestingly, the number of connections stuck in CLOSE_WAIT decrease 
during indexing and increase again about 10 or so seconds after the indexing is 
stopped.

I've observed that too and it's not that they decrease, but rather that the 
connections change their state from CLOSE_WAIT to ESTABLISHED, then when 
indexing is done to TIME_WAIT and then finally to CLOSE_WAIT again. I believe 
this aligns with what the HC documentation says -- the connections are not 
necessarily released, but kept in the pool. When you re-index again, they are 
reused and go back to the pool.

bq. However, this commit only increases the limits on how many update 
connections that can be open

That's interesting and might be a temporary workaround for the problem, which I 
intend to test shortly. In 5.4.1 they were both modified to 100,000:

{noformat}
-  public static final int DEFAULT_MAXUPDATECONNECTIONS = 1;
-  public static final int DEFAULT_MAXUPDATECONNECTIONSPERHOST = 100;
+  public static final int DEFAULT_MAXUPDATECONNECTIONS = 10;
+  public static final int DEFAULT_MAXUPDATECONNECTIONSPERHOST = 10;
{noformat}

This can explain why we run into trouble with 5.5.1 but not with 5.4.1. Though 
even in 5.4.1 there are few hundreds of CLOSE_WAIT connections, with 5.5.1 they 
reach (in our case) the orders of 35-40K, at which point Solr became useless, 
not being able to talk to the replica or pretty much anything else.

I see these can be defined in solr.xml, though it's not documented how, so I'm 
going to give it a try and will report back here.

> TCP-connections in CLOSE_WAIT spikes during heavy indexing when SSL is enabled
> --
>
> Key: SOLR-9290
> URL: https://issues.apache.org/jira/browse/SOLR-9290
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.1, 5.5.2
>Reporter: Anshum Gupta
>Priority: Critical
> Attachments: SOLR-9290-debug.patch, setup-solr.sh
>
>
> Heavy indexing on Solr with SSL leads to a lot of connections in CLOSE_WAIT 
> state. 
> At my workplace, we have seen this issue only with 5.5.1 and could not 
> reproduce it with 5.4.1 but from my conversation with Shalin, he knows of 
> users with 5.3.1 running into this issue too. 
> Here's an excerpt from the email [~shaie] sent to the mailing list  (about 
> what we see:
> {quote}
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
> {quote}
> Here's the mail thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c46cc66220a8143dc903fa34e79205...@vp-exc01.dips.local%3E
> Creating this issue so we could track this and have more people comment on 
> what they see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7253) Sparse data in doc values and segments merging

2016-05-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268847#comment-15268847
 ] 

Shai Erera commented on LUCENE-7253:


I thought so, but that still needs to be benchmarked. Maybe [~prog] has an idea 
of an implementation that will keep both efficient? Maybe read time isn't 
affected, but merge is? Maybe if we choose to sparsely encode fields that are 
very sparse, the read time isn't affected? As a first step that can work. Point 
is we shouldn't shoot down an idea before we have code/results to back the 
shooting.

And I agree that if we had iterator-like API it would make a stronger case for 
sparse DV. Maybe though both need not be coupled and one can be done before the 
other.

> Sparse data in doc values and segments merging 
> ---
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.5, 6.0
>Reporter: Pawel Rog
>  Labels: performance
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
>new Iterable() {
>  @Override
>  public Iterator iterator() {
>return new NumericIterator(maxDoc, values, 
> docsWithField);
>  }
>});
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
>   pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used only internally in 
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be 
> good to change it with different class (some kind of list) because this may 
> break DV performance for dense fields. I hope someone can suggest interesting 
> solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can 
> start here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7253) Sparse data in doc values and segments merging

2016-05-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268770#comment-15268770
 ] 

Shai Erera commented on LUCENE-7253:


bq. Read some actual literature on column store databases, see how these 
situations are handled.

Would be great if you can recommend some particular references.

bq. I'm not going to argue with you guys here, because your argument is 
pathetic ... After you have educated yourselves, you will look less silly.

I don't get the patronizing tone, really.

--

What if the numeric DV consumer encoded the data differently based on the 
cardinality of the field? Dense fields would be encoded as today and low 
cardinality ones encode two arrays of docs and values (over simplifying, I 
know)? We can then benchmark what 'dense' means (50%/10%/100 docs) based on 
benchmark results?

It's hard to overrule an idea without (a) an implementation that we can refer 
to and (b) proof that it does help in some cases as well not make other cases 
worse.

[~prog]: maybe you should start playing with the idea, upload some patches, 
perform some benchmarks etc. Then we'll have more data to discuss and decide if 
this is worth pursuing or not. What do you think?

> Sparse data in doc values and segments merging 
> ---
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.5, 6.0
>Reporter: Pawel Rog
>  Labels: performance
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
>new Iterable() {
>  @Override
>  public Iterator iterator() {
>return new NumericIterator(maxDoc, values, 
> docsWithField);
>  }
>});
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
>   pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used only internally in 
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be 
> good to change it with different class (some kind of list) because this may 
> break DV performance for dense fields. I hope someone can suggest interesting 
> solutions for this problem :).
> It would be great if discussion about sparse D

[jira] [Commented] (SOLR-9057) CloudSolrClient should be able to work w/o ZK url

2016-05-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268367#comment-15268367
 ] 

Shai Erera commented on SOLR-9057:
--

In CSC code I see that {{connect()}} is called from several places, one of 
which is {{sendRequest}} and another is {{requestWithRetryOnStaleState}}. And 
in {{connect()}} I see that a watcher is created both by calling {{new 
ZkStateReader(zkHost, zkClientTimeout, zkConnectTimeout)}} and immediately 
after {{zk.createClusterStateWatchersAndUpdate()}}.

I don't reject your statement about my understanding of how CSC works, but 
could you please explain how it does not create a watcher today? Or if that's 
case today and this issue is about changing it, what are you proposing to 
change?

If you prefer to wait with answering these questions until you have a patch, 
I'm OK with that too.

> CloudSolrClient should be able to work w/o ZK url
> -
>
> Key: SOLR-9057
> URL: https://issues.apache.org/jira/browse/SOLR-9057
> Project: Solr
>  Issue Type: Bug
>  Components: SolrJ
>Reporter: Noble Paul
>
> It should be possible to pass one or more Solr urls to Solrj and it should be 
> able to get started from there. Exposing ZK to users should not be required. 
> it is a security vulnerability 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9057) CloudSolrClient should be able to work w/o ZK url

2016-05-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268329#comment-15268329
 ] 

Shai Erera commented on SOLR-9057:
--

I thought that the whole idea of CSC is to use ZkStateReader so that it can 
react to state changes quickly, because ZkStateReader does create a watch on 
the cluster state. If it doesn't use ZkStateReader anymore, will it 
periodically poll CLUSTERSTATUS? Isn't that less efficient and maybe even doing 
a lot of redundant CLUSTERSTATUS checks when the cluster state doesn't change?

I always viewed CSC and its use of ZkStateReader as an advantage. I do 
understand though that it currently plays two roles, which I believe you 
propose to separate: (1) understanding the distributed topology of the Solr 
nodes, so that it forwards requests to leaders etc. and (2) getting notified on 
cluster state changes rather than querying for it repeatedly.

I personally think that CSC should continue to use ZkStateReader and be tied to 
it. Users who don't want to expose/get-exposed to ZK can use a regular 
HttpSolrClient. True, their requests may get routed to the right node (so that 
adds an extra hop), but perhaps it's not that bad?

Alternatively, you could have CSC take a ClusterStateProvider with two impls: 
one that uses HTTP CLUSTERSTATUS and another that uses ZkStateReader. Then 
users can enjoy the best of both worlds: CSC does the "right" thing and the 
user can choose whether to work w/ the HTTP end-point or the ZK one.

> CloudSolrClient should be able to work w/o ZK url
> -
>
> Key: SOLR-9057
> URL: https://issues.apache.org/jira/browse/SOLR-9057
> Project: Solr
>  Issue Type: Bug
>  Components: SolrJ
>Reporter: Noble Paul
>
> It should be possible to pass one or more Solr urls to Solrj and it should be 
> able to get started from there. Exposing ZK to users should not be required. 
> it is a security vulnerability 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-9057) CloudSolrClient should be able to work w/o ZK url

2016-05-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268315#comment-15268315
 ] 

Shai Erera edited comment on SOLR-9057 at 5/3/16 7:49 AM:
--

How will it initiate {{ZkStateReader}} without getting the ZK host? Or do you 
mean it will extract the ZK info from one of the Solr URLs, by submitting a 
call like {{/admin/info/system}}?


was (Author: shaie):
How will it initiate {{ZkStateReader}} without getting the ZK host? Or do you 
mean it will extract the ZK info from one of the Solr URLs, but submitting a 
call like {{/admin/info/system}}?

> CloudSolrClient should be able to work w/o ZK url
> -
>
> Key: SOLR-9057
> URL: https://issues.apache.org/jira/browse/SOLR-9057
> Project: Solr
>  Issue Type: Bug
>  Components: SolrJ
>Reporter: Noble Paul
>
> It should be possible to pass one or more Solr urls to Solrj and it should be 
> able to get started from there. Exposing ZK to users should not be required. 
> it is a security vulnerability 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9057) CloudSolrClient should be able to work w/o ZK url

2016-05-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268315#comment-15268315
 ] 

Shai Erera commented on SOLR-9057:
--

How will it initiate {{ZkStateReader}} without getting the ZK host? Or do you 
mean it will extract the ZK info from one of the Solr URLs, but submitting a 
call like {{/admin/info/system}}?

> CloudSolrClient should be able to work w/o ZK url
> -
>
> Key: SOLR-9057
> URL: https://issues.apache.org/jira/browse/SOLR-9057
> Project: Solr
>  Issue Type: Bug
>  Components: SolrJ
>Reporter: Noble Paul
>
> It should be possible to pass one or more Solr urls to Solrj and it should be 
> able to get started from there. Exposing ZK to users should not be required. 
> it is a security vulnerability 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7253) Sparse data in doc values and segments merging

2016-05-02 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266764#comment-15266764
 ] 

Shai Erera commented on LUCENE-7253:


To add to the sparsity discussion, when I did the numeric DV updates I already 
wrote (somewhere) that I think if we could cater for sparse DV fields better, 
it might also improve the numeric DV updates case. Today when you update a 
numeric DV field, we rewrite the entire DV for that field in the "stacked" DV. 
This works well if you perform many updates before you flush/commit, but if you 
only update the value of one document, that's costly. If we could write just 
that one update to a stack, we could _collapse_ the stacks at read time.

Of course, that _collapsing_ might slow searches down, so the whole idea of 
writing just the updated values needs to be benchmarked before we actually do 
it, so I'm not proposing that here. Just wanted to give another (potential) use 
case for sparse DV fields.

And FWIW, I do agree with [~yo...@apache.org] and [~dsmiley] about sparse DV 
not being an abuse case, as I'm seeing them very often too. That's of course 
unless you mean something else by abuse case...

> Sparse data in doc values and segments merging 
> ---
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.5, 6.0
>Reporter: Pawel Rog
>  Labels: performance
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
>new Iterable() {
>  @Override
>  public Iterator iterator() {
>return new NumericIterator(maxDoc, values, 
> docsWithField);
>  }
>});
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
>   pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used only internally in 
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be 
> good to change it with different class (some kind of list) because this may 
> break DV performance for dense fields. I hope someone can suggest interesting 
> solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can 
&g

Re: [VOTE] Release Lucene/Solr 5.5.1

2016-05-02 Thread Shai Erera
When I ran the smoke tester for the first time, I encountered this test
failure:

[junit4] Suite: org.apache.solr.security.TestPKIAuthenticationPlugin
[junit4] 2> Creating dataDir:
/tmp/smoke_lucene_5.5.1_c08f17bca0d9cbf516874d13d221ab100e5b7d58_3/unpack/solr-5.5.1/solr/build/solr-core/test/J3/temp/solr.security.TestPKIAuthenticationPlugin_4643E7DFA3C28AD5-001/init-core-data-001
[junit4] 2> 48028 INFO
(SUITE-TestPKIAuthenticationPlugin-seed#[4643E7DFA3C28AD5]-worker) [ ]
o.a.s.SolrTestCaseJ4 Randomized ssl (true) and clientAuth (false)
[junit4] 2> 48031 INFO
(TEST-TestPKIAuthenticationPlugin.test-seed#[4643E7DFA3C28AD5]) [ ]
o.a.s.SolrTestCaseJ4 ###Starting test
[junit4] 2> 48323 ERROR
(TEST-TestPKIAuthenticationPlugin.test-seed#[4643E7DFA3C28AD5]) [ ]
o.a.s.s.PKIAuthenticationPlugin No SolrAuth header present
[junit4] 2> 48377 ERROR
(TEST-TestPKIAuthenticationPlugin.test-seed#[4643E7DFA3C28AD5]) [ ]
o.a.s.s.PKIAuthenticationPlugin Invalid key
[junit4] 2> 48377 INFO
(TEST-TestPKIAuthenticationPlugin.test-seed#[4643E7DFA3C28AD5]) [ ]
o.a.s.SolrTestCaseJ4 ###Ending test
[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=TestPKIAuthenticationPlugin -Dtests.method=test
-Dtests.seed=4643E7DFA3C28AD5 -Dtests.locale=ja-JP
-Dtests.timezone=Australia/Lindeman -Dtests.asserts=true
-Dtests.file.encoding=US-ASCII
[junit4] ERROR 0.35s J3 | TestPKIAuthenticationPlugin.test <<<
[junit4] > Throwable #1: java.lang.NullPointerException
[junit4] > at
__randomizedtesting.SeedInfo.seed([4643E7DFA3C28AD5:CE17D8050D3EE72D]:0)
[junit4] > at
org.apache.solr.security.TestPKIAuthenticationPlugin.test(TestPKIAuthenticationPlugin.java:156)
[junit4] > at java.lang.Thread.run(Thread.java:745)
[junit4] 2> 48379 INFO
(SUITE-TestPKIAuthenticationPlugin-seed#[4643E7DFA3C28AD5]-worker) [ ]
o.a.s.SolrTestCaseJ4 ###deleteCore
[junit4] 2> NOTE: leaving temporary files on disk at:
/tmp/smoke_lucene_5.5.1_c08f17bca0d9cbf516874d13d221ab100e5b7d58_3/unpack/solr-5.5.1/solr/build/solr-core/test/J3/temp/solr.security.TestPKIAuthenticationPlugin_4643E7DFA3C28AD5-001
[junit4] 2> NOTE: test params are: codec=Asserting(Lucene54): {},
docValues:{}, sim=DefaultSimilarity, locale=ja-JP,
timezone=Australia/Lindeman
[junit4] 2> NOTE: Linux 4.2.0-30-generic amd64/Oracle Corporation 1.7.0_80
(64-bit)/cpus=8,threads=1,free=161219560,total=432537600
[junit4] 2> NOTE: All tests run in this JVM: [TestAtomicUpdateErrorCases,
TestDefaultStatsCache, TestFiltering, PluginInfoTest,
HdfsWriteToMultipleCollectionsTest, DistributedFacetPivotSmallAdvancedTest,
ConnectionManagerTest, TestJoin, ShardRoutingTest,
WrapperMergePolicyFactoryTest, IndexSchemaRuntimeFieldTest,
TestClassNameShortening, SimpleCollectionCreateDeleteTest,
TestManagedResource, BigEndianAscendingWordDeserializerTest,
HdfsRestartWhileUpdatingTest, TestSolrDeletionPolicy1, TestConfigReload,
TestSolrJ, TestIndexingPerformance, TestInitQParser,
AlternateDirectoryTest, TestConfigOverlay, TestCSVResponseWriter,
SpatialRPTFieldTypeTest, SolrIndexSplitterTest, DistributedVersionInfoTest,
TestSmileRequest, TestPKIAuthenticationPlugin]

Second time it passed. I didn't have time to dig into the failure, so I
can't tell if it should hold off the release. What do you think?

Shai

On Sun, May 1, 2016 at 12:26 AM Anshum Gupta  wrote:

> Please vote for the RC1 release candidate for Lucene/Solr 5.5.1.
>
> Artifacts:
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-5.5.1-RC1-revc08f17bca0d9cbf516874d13d221ab100e5b7d58
>
> Smoke tester:
>
>   python3 -u dev-tools/scripts/smokeTestRelease.py
> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-5.5.1-RC1-revc08f17bca0d9cbf516874d13d221ab100e5b7d58
>
>
> Here's my +1:
>
> SUCCESS! [0:26:44.452268]
>
>
> --
> Anshum Gupta
>


[jira] [Commented] (SOLR-9016) SolrIdentifierValidator accepts empty names

2016-04-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259689#comment-15259689
 ] 

Shai Erera commented on SOLR-9016:
--

Thanks [~anshumg] for doing all the backports!

> SolrIdentifierValidator accepts empty names
> ---
>
> Key: SOLR-9016
> URL: https://issues.apache.org/jira/browse/SOLR-9016
> Project: Solr
>  Issue Type: Bug
>  Components: Server
>    Reporter: Shai Erera
> Fix For: 5.5.1, 6.1, 6.0.1
>
> Attachments: SOLR-9016.patch
>
>
> SolrIdentifierValidator accepts shard, collection, cores and alias names 
> following this pattern:
> {code}
> ^(?!\\-)[\\._A-Za-z0-9\\-]*$
> {code}
> This accepts an "empty" name. This is easily fixable by changing the {{\*}} 
> to {{+}}. However, it also accepts names such as {{..}}, {{,__---}} etc. Do 
> we not want to require collection names to have a letter/digit identifier in 
> them? Something like the following pattern:
> {code}
> ^(\\.)?[a-zA-Z0-9]+[\\._\\-a-zA-Z0-9]*$
> {code}
> That pattern requires the name to start with an optional {{.}} followed by a 
> series of letters/digits followed by the rest of the allowed characters.
> What do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9016) SolrIdentifierValidator accepts empty names

2016-04-26 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258170#comment-15258170
 ] 

Shai Erera commented on SOLR-9016:
--

All tests pass, so if there are no objections, I'd like to push this change so 
that it even makes it into 5.5.1.

> SolrIdentifierValidator accepts empty names
> ---
>
> Key: SOLR-9016
> URL: https://issues.apache.org/jira/browse/SOLR-9016
> Project: Solr
>  Issue Type: Bug
>  Components: Server
>    Reporter: Shai Erera
> Attachments: SOLR-9016.patch
>
>
> SolrIdentifierValidator accepts shard, collection, cores and alias names 
> following this pattern:
> {code}
> ^(?!\\-)[\\._A-Za-z0-9\\-]*$
> {code}
> This accepts an "empty" name. This is easily fixable by changing the {{\*}} 
> to {{+}}. However, it also accepts names such as {{..}}, {{,__---}} etc. Do 
> we not want to require collection names to have a letter/digit identifier in 
> them? Something like the following pattern:
> {code}
> ^(\\.)?[a-zA-Z0-9]+[\\._\\-a-zA-Z0-9]*$
> {code}
> That pattern requires the name to start with an optional {{.}} followed by a 
> series of letters/digits followed by the rest of the allowed characters.
> What do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9016) SolrIdentifierValidator accepts empty names

2016-04-26 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated SOLR-9016:
-
Attachment: SOLR-9016.patch

Patch fixes the regex to not accept empty identifiers, however it does not 
modify the rule, i.e. someone could still use an identifier like {{\_\_.--}} if 
they want to. I'll be happy to change that, but since I didn't receive any 
feedback I think this fix is the least we can do (and also push into 5.5.1).

The patch also modifies the exception message slightly.

> SolrIdentifierValidator accepts empty names
> ---
>
> Key: SOLR-9016
> URL: https://issues.apache.org/jira/browse/SOLR-9016
> Project: Solr
>  Issue Type: Bug
>  Components: Server
>    Reporter: Shai Erera
> Attachments: SOLR-9016.patch
>
>
> SolrIdentifierValidator accepts shard, collection, cores and alias names 
> following this pattern:
> {code}
> ^(?!\\-)[\\._A-Za-z0-9\\-]*$
> {code}
> This accepts an "empty" name. This is easily fixable by changing the {{\*}} 
> to {{+}}. However, it also accepts names such as {{..}}, {{,__---}} etc. Do 
> we not want to require collection names to have a letter/digit identifier in 
> them? Something like the following pattern:
> {code}
> ^(\\.)?[a-zA-Z0-9]+[\\._\\-a-zA-Z0-9]*$
> {code}
> That pattern requires the name to start with an optional {{.}} followed by a 
> series of letters/digits followed by the rest of the allowed characters.
> What do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-9016) SolrIdentifierValidator accepts empty names

2016-04-20 Thread Shai Erera (JIRA)
Shai Erera created SOLR-9016:


 Summary: SolrIdentifierValidator accepts empty names
 Key: SOLR-9016
 URL: https://issues.apache.org/jira/browse/SOLR-9016
 Project: Solr
  Issue Type: Bug
  Components: Server
Reporter: Shai Erera


SolrIdentifierValidator accepts shard, collection, cores and alias names 
following this pattern:

{code}
^(?!\\-)[\\._A-Za-z0-9\\-]*$
{code}

This accepts an "empty" name. This is easily fixable by changing the {{\*}} to 
{{+}}. However, it also accepts names such as {{..}}, {{,__---}} etc. Do we not 
want to require collection names to have a letter/digit identifier in them? 
Something like the following pattern:

{code}
^(\\.)?[a-zA-Z0-9]+[\\._\\-a-zA-Z0-9]*$
{code}

That pattern requires the name to start with an optional {{.}} followed by a 
series of letters/digits followed by the rest of the allowed characters.

What do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene/Solr 5.5.1

2016-04-10 Thread Shai Erera
+1, this (SOLR-8642) has bitten us already and had to revert back to 5.4.x.

Shai

On Sun, Apr 10, 2016 at 8:00 PM Anshum Gupta  wrote:

> Hi,
>
> I would like to release 5.5.1, specially for to SOLR-8725.
>
> SOLR-8642 in 5.5 stops people from upgrading to 5.5 and a lot of users
> have spoken about it on the mailing list and the JIRA.
>
> I would like to start the process towards the end of the coming week.
>
>
> --
> Anshum Gupta
>


[jira] [Resolved] (SOLR-8793) Fix stale commit files' size computation in LukeRequestHandler

2016-03-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved SOLR-8793.
--
   Resolution: Fixed
 Assignee: Shai Erera
Fix Version/s: 5.5.1
   master

Pushed the fix to master, branch_6x, branch_6_0, branch_5x and branch_5_5. I 
think it would be good if it's released in a 5.5.1.

> Fix stale commit files' size computation in LukeRequestHandler
> --
>
> Key: SOLR-8793
> URL: https://issues.apache.org/jira/browse/SOLR-8793
> Project: Solr
>  Issue Type: Bug
>  Components: Server
>Affects Versions: 5.5
>    Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: master, 5.5.1
>
> Attachments: SOLR-8793.patch
>
>
> SOLR-8587 added segments file information and its size to core admin status 
> API. However in case of stale commits, calling that API may result on 
> {{FileNotFoundException}} or {{NoSuchFileException}}, if the segments file no 
> longer exists due to a new commit. We should fix that by returning a proper 
> value for the file's length in this case, maybe -1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8728) Splitting a shard of a collection created with a rule fails with NPE

2016-03-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185223#comment-15185223
 ] 

Shai Erera commented on SOLR-8728:
--

We usually set the fix version to be e.g. "5.5" and "trunk/master".

Cause there are issues that are fixed only in a specific version, e.g. if they 
only affect that version.

> Splitting a shard of a collection created with a rule fails with NPE
> 
>
> Key: SOLR-8728
> URL: https://issues.apache.org/jira/browse/SOLR-8728
> Project: Solr
>  Issue Type: Bug
>Reporter: Shai Erera
>Assignee: Noble Paul
> Fix For: 6.0
>
> Attachments: SOLR-8728.patch, SOLR-8728.patch
>
>
> Spinoff from this discussion: http://markmail.org/message/f7liw4hqaagxo7y2
> I wrote a short test which reproduces, will upload shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-8793) Fix stale commit files' size computation in LukeRequestHandler

2016-03-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated SOLR-8793:
-
Attachment: SOLR-8793.patch

Patch fixes the bug by catching the {{IOException}} and returning -1. In that 
case, the index info will should a file size of -1, until the reader is 
refreshed.

I chose to return a -1 over setting an empty string, or not returning the value 
at all since I feel it's better, but if others think otherwise, please comment.

> Fix stale commit files' size computation in LukeRequestHandler
> --
>
> Key: SOLR-8793
> URL: https://issues.apache.org/jira/browse/SOLR-8793
> Project: Solr
>  Issue Type: Bug
>  Components: Server
>Affects Versions: 5.5
>    Reporter: Shai Erera
>Priority: Minor
> Attachments: SOLR-8793.patch
>
>
> SOLR-8587 added segments file information and its size to core admin status 
> API. However in case of stale commits, calling that API may result on 
> {{FileNotFoundException}} or {{NoSuchFileException}}, if the segments file no 
> longer exists due to a new commit. We should fix that by returning a proper 
> value for the file's length in this case, maybe -1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8728) Splitting a shard of a collection created with a rule fails with NPE

2016-03-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185200#comment-15185200
 ] 

Shai Erera commented on SOLR-8728:
--

This is marked as fixed in 6.0, but should it also be marked for 6.1(since it's 
also committed to 6x)?

What about master -- was it not committed to master too? Does it not affect 
master?

And lastly, in case we will have a 5.5.1, is this considered a bugfix that 
we'll want to backport?

> Splitting a shard of a collection created with a rule fails with NPE
> 
>
> Key: SOLR-8728
> URL: https://issues.apache.org/jira/browse/SOLR-8728
> Project: Solr
>  Issue Type: Bug
>    Reporter: Shai Erera
>Assignee: Noble Paul
> Fix For: 6.0
>
> Attachments: SOLR-8728.patch, SOLR-8728.patch
>
>
> Spinoff from this discussion: http://markmail.org/message/f7liw4hqaagxo7y2
> I wrote a short test which reproduces, will upload shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8587) Add segments file information to core admin status

2016-03-06 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182640#comment-15182640
 ] 

Shai Erera commented on SOLR-8587:
--

OK yea you're right, I was confused. The file can be read by the open IR, but 
won't appear in directory listing. I opened SOLR-8793 to fix this, sorry for 
that!

Is there a workaround until the fix is released? Refresh the searcher maybe?

> Add segments file information to core admin status
> --
>
> Key: SOLR-8587
> URL: https://issues.apache.org/jira/browse/SOLR-8587
> Project: Solr
>  Issue Type: Improvement
>    Reporter: Shai Erera
>    Assignee: Shai Erera
> Fix For: 5.5, master
>
> Attachments: SOLR-8587.patch, SOLR-8587.patch
>
>
> Having the index's segments file name returned by CoreAdminHandler STATUS can 
> be useful. The info I'm thinking about is the segments file name and its 
> size. If you record that from time to time, in a case of crisis, when u need 
> to restore the index and may not be sure which copy you need to restore, this 
> tiny piece of info can be very useful, as the segmentsN file records the 
> commit point, and therefore what you core reported and what you see at hand 
> can help you make a safer decision.
> I also think it's useful info in general, e.g. probably even more than 
> 'version', and it doesn't add much complexity to the handler or the response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >