Re: ***UNCHECKED*** Limit Solr search to number of character/words (without changing index)

2018-01-29 Thread alessandro.benedetti
Taking a look to Lucene code, this seems the closest query to your
requirement :

org.apache.lucene.search.spans.SpanPositionRangeQuery

But it is not used in Solr out of the box according to what I know.
You may potentially develop a query parser and use it to reach your goals.

Given that, I think the index time strategy will be much easier and it will
just require a re-index and few small changes at query time configuration.
Another possibility may be to use payloads and the related query parser, but
also in this case you would need to re-index so it is unlikely that this
option would be your favorite.
I appreciate the fact you can not re-index, so in this case you will need to
follow the other approaches ( developing components).

Regards





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: ***UNCHECKED*** Limit Solr search to number of character/words (without changing index)

2018-01-29 Thread alessandro.benedetti
This seems different from what you initially asked ( and Diego responded)
"One is simple, search query will look for whole content indexed in XYZ
field 
Other one is, search query will have to look for first 100 characters 
indexed in same XYZ field. "

This is still doable at Indexing time using a copy field.
You can have your "originalField" and your "truncatedField" with no problem
at all.
Just use a combination of copyFields[1] and what Erick suggested.

Cheers

[1] https://lucene.apache.org/solr/guide/6_6/copying-fields.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 4.8.1 multiple client updates the same collection

2018-01-29 Thread alessandro.benedetti
Generally speaking, if a full re-index is happening everyday, wouldn't be
better to use a technique such as collection alias ?

You could point your search clients to the "Alias" which points to the
online collection "collection1".
When you re-index you build "collection2", when it is finished you point
"Alias" to "collection2" .
The following day you do the same thing but you use "collection1" to index.

Client 2 for the atomic Updates will point to "Alias" .

I am assuming here that during the re-indexing the price we get in the fresh
index are the most up to date.
So as soon as re-index finishes the collection is perfectly up to date.

In the case you want to update the prices during re-indexing, the price
updater should point to the temporary collection.
Also in this case I assume that if a document was not indexed yet, the price
update will fail, but the document will get the correct price when it is
indexed.
Please correct any wrong assumption,

Cheers





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Phonetic matching relevance

2018-01-29 Thread alessandro.benedetti
when you say : "However, for the phonetic matches, there are some matches
closer to the query text than others. How can we boost these results ? "

Do you mean closer in String edit distance ?
If that is the case you could use the String distance metric implemented in
Solr with a function query :
>From the wiki[1] : 

*strdist*
Calculate the distance between two strings. Uses the Lucene spell checker
StringDistance interface and supports all of the implementations available
in that package, plus allows applications to plug in their own via Solr’s
resource loading capabilities. strdist takes (string1, string2, distance
measure).

Possible values for distance measure are:

jw: Jaro-Winkler

edit: Levenstein or Edit distance

ngram: The NGramDistance, if specified, can optionally pass in the ngram
size too. Default is 2.

FQN: Fully Qualified class Name for an implementation of the StringDistance
interface. Must have a no-arg constructor.
e.g.
strdist("SOLR",id,edit)

You can add this to the edismax using a boost function ( boost parameter)
[2]

[1] https://lucene.apache.org/solr/guide/6_6/function-queries.html
[2] https://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Spellcheck collations results

2018-01-25 Thread alessandro.benedetti
Can you tell us the request parameters used for the spellcheck ?

In particular are you using these ? (from the wiki) :

" The *spellcheck.maxCollationTries* Parameter
This parameter specifies the number of collation possibilities for Solr to
try before giving up. Lower values ensure better performance. Higher values
may be necessary to find a collation that can return results. The default
value is 0, which maintains backwards-compatible (Solr 1.4) behavior (do not
check collations). This parameter is ignored if spellcheck.collate is false.

The *spellcheck.maxCollationEvaluations* Parameter
This parameter specifies the maximum number of word correction combinations
to rank and evaluate prior to deciding which collation candidates to test
against the index. This is a performance safety-net in case a user enters a
query with many misspelled words. The default is 10,000 combinations, which
should work well in most situations. "

Regards





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: LTR original score feature

2018-01-24 Thread alessandro.benedetti
This is actually an interesting point.
The original Solr score alone will mean nothing, the ranking position of the
document would be a more relevant feature at that stage.

When you put the original score together with the rest of features, it may
be of potential usage ( number of query terms, tf for a specific field, idf
for another field ...).
Also because some training algorithms will group the training samples by
query.

personally I start to believe it would be better to decompose the original
score into finer grain features and then rely on LTR to weight them ( as the
original score is effectively already mixing up finer grain features
following a standard formula).





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr cluster: solr auto suggestion with requestHandler

2018-01-24 Thread alessandro.benedetti
have you tried adding the "distrib =true" request parameter when building the
suggester ?
It should be by default, but trying explicitly won't harm.

I think nowadays the suggester component is Solr Cloud compatible, I have no
chance to test it right now but it should just works.
Worst case you can proceed debugging a bit if anything interesting is in the
logs.
Give it a try!

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Preserve order during indexing

2018-01-24 Thread alessandro.benedetti
Hi Mikhail,
but if he keeps the docs within a segment, the ordering may be correct just
temporary right ?
As soon as a segment merge happens ( for example after sequent indexing
sessions or updates) the internal Lucene doc Id may change and the default
order Solr side may change, right ?
I am just asking as I never investigated what happens to Lucene internal Ids
at merging time.

Following the other comments I think a more robust approach would be to
explicitly describe a sorting order and manage it through Solr sorting
directly.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: LTR and features operating on children doc data

2018-01-24 Thread alessandro.benedetti
I think this has nothing to do with LTR in particular.
have you tried executing the function query on its own ?

I think it doesn't exist at all, right ? [1]
So maybe the first approach to that would be to add this nested children
function query capability to Solr.
I think there is a document Transformer in place to return all the children
of parents document queries but unfortunately I don't think there is
function query at the moment able to do the calculations you want.
Then it may be included in LTR ( which may not need any change at all)

[1]
https://lucene.apache.org/solr/guide/6_6/function-queries.html#FunctionQueries-AvailableFunctions



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Strange Alias behavior

2018-01-24 Thread alessandro.benedetti
b2b-catalog-material-etl -> b2b-catalog-material 
b2b-catalog-material -> b2b-catalog-material-180117 

and we do a data load to b2b-catalog-material-etl 

We see data being added to both b2b-catalog-material and 
b2b-catalog-material-180117  -> *in here you wanted just to index in
b2b-catalog-material-180117 I assume*

when I delete the alias b2b-catalog-material then the data stopped loading 
into the collection b2b-catalog-material-180117  -> *this makes sense as you
deleted the alias so the data will just go the b2b-catalog-material
collection.*
Why haven't you deleted the old collection instead? what was the purpose of
deleting the alias ?

To wrap it up, what is that you don't like ?
is this bit "We see data being added to both b2b-catalog-material and 
b2b-catalog-material-180117" ?




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using 'learning to rank' with user specific features

2018-01-24 Thread alessandro.benedetti
Hi,
let me see if I got your problem :
 your "user specific" features are Query dependent features from Solr side.
The value of this feature depends on a query component ( the user Id) and a
document component( product Id)
You can definitely use them.
You can model this feature as a binary feature.
1 means the product was coming from friends
0 means the product was not.

At training time, you need to provide the value to each training row.
At query time you may need a custom feature type.
You can pass the user id as an EFI.
In that situation the custom feature will query the external server to get
the friend's products and then you can calculate it.
Of course you can implement the custom feature as you wish.
That will strictly depend on how you decide to implement the user-product
interactions tracking and retrieval system.




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Using lucene to post-process Solr query results

2018-01-24 Thread alessandro.benedetti
I have never been a big fan of " getting N results from Solr and then filter
them client side" .
I get your point about the document modelling, so I will assume you properly
tested it and having the small documents at Solr side is really not
sustainable. 

I also appreciate the fact you want to finally return just the children
documents.

Possible flaws in getting N and then filter K client side is that you may
end up in 0 results even if there are actual results (
e.g. 
you have a total of 1000 results from Solr
you get the top 10.
you split this top 10 creating 100 childrend docs, but none of them matches
the query anymore.
In the remaining 990 results there could be valid children documents that
are not returned.

Have you tried nested documents as well by any chance ? (keep in mind that a
child document is still a Solr document so it may be not a good fit for
you).




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti
Thanks Yonik and thanks Doug.

I agree with Doug in adding few generics test corpora Jenkins automatically
runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
golden truth too much.
This of course can be very complex, but I think it is a direction the Apache
Lucene/Solr community should work on.

Given that, I do believe that in this case, moving from maxDocs(field
independent) to docCount(field dependent) was a good move ( and this
specific multi language use case is an example).

Actually I also believe that theoretically docCount(field dependent) is
still better than maxDocs(field dependent).
This is because docCount(field dependent) represents a state in time
associated to the current index while maxDocs represents an historical
consideration.
A corpus of documents can change in time, and how much a term is rare can
drastically change ( let's pick an highly dynamic domain such news).

Doug, were you able to generalise and abstract any consideration from what
happened to your customers and why they got regressions moving from maxDocs
to docCount(field dependent) ?




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti
"Lucene/Solr doesn't actually delete documents when you delete them, it 
just marks them as deleted.  I'm pretty sure that the difference between 
docCount and maxDoc is deleted documents.  Maybe I don't understand what 
I'm talking about, but that is the best I can come up with. "

Thanks Shawn, yes, that is correct and I was aware of it.
I was curious of another difference :
I think we confirmed that docCount is local to the field ( thanks Yonik for
that) so :

docCount(index,field1)= # of documents in the index that currently have
value(s) for field1

My question is :

maxDocs(index,field1)= max # of documents in the index that had value(s) for
field1

OR

maxDocs(index)= max # of documents that appeared in the index ( field
independent)

Regards




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr score use cases

2017-12-04 Thread alessandro.benedetti
I would like to stress how important is what Erick explained.
A lot of times people want to use the score to show it to the
users/calculate probability/doing weird calculations.

Score is used to rank results, given a query.
To give a local ordering.
This is the only useful information for the end user.

>From an administrator/developer perspective is different, debugging the
score could be vital, mainly for relevancy tuning and understanding ranking
bugs.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread alessandro.benedetti
Furthermore, taking a look to the code for BM25 similarity, it seems to me it
is currently working right :
- docCount is used per field if != -1


/**
   * Computes a score factor for a simple term and returns an explanation
   * for that score factor.
   * 
   * 
   * The default implementation uses:
   * 
   * 
   * idf(docFreq, docCount);
   * 
   * 
   * Note that {@link CollectionStatistics#docCount()} is used instead of
   * {@link org.apache.lucene.index.IndexReader#numDocs()
IndexReader#numDocs()} because also 
   * {@link TermStatistics#docFreq()} is used, and when the latter 
   * is inaccurate, so is {@link CollectionStatistics#docCount()}, and in
the same direction.
   * In addition, {@link CollectionStatistics#docCount()} does not skew when
fields are sparse.
   *   
   * @param collectionStats collection-level statistics
   * @param termStats term-level statistics for the term
   * @return an Explain object that includes both an idf score factor 
 and an explanation for the term.
   */
  public Explanation idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats) {
final long df = termStats.docFreq();
final long docCount = collectionStats.docCount() == -1 ?
collectionStats.maxDoc() : collectionStats.docCount();
final float idf = idf(df, docCount);
return Explanation.match(idf, "idf, computed as log(1 + (docCount -
docFreq + 0.5) / (docFreq + 0.5)) from:",
Explanation.match(df, "docFreq"),
Explanation.match(docCount, "docCount"));
  }



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread alessandro.benedetti
Hi Markus,
just out of interest, why did 
" It was solved back then by using docCount instead of maxDoc when
calculating idf, it worked really well!" solve the problem ?

i assume you are using different fields, one per language.
Each field is appearing on a different number of docs I guess.
e.g.
text_en -> 1 docs
text_fr -> 1000 docs
text_it -> 500 docs

the reason docCount was improving things is because it was using a docCount
relative to a specific field while maxDoc is global all over the index ?







-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Solr Spellcheck

2017-11-29 Thread alessandro.benedetti
"Can you please suggest suitable configuration for spell check to work
correctly.  I am indexing all the words in one  column. With current
configuration I am not getting good suggestions "

This is very vague. 
Spellchecking is working correctly according to your configurations...
Let's start from the beginning, What are your requirements ?



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Solr Spellcheck

2017-11-28 Thread alessandro.benedetti
You spellcheck configurations are quite extensive !
In particular you specified :

  
  0.01 

This means that if the term appears in less than 1 % total docs  it will be
considered misspelled.
Is cholera occurring in your corpus > 1% total docs ?



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Inverted Index positions vs Term Vector positions

2017-11-27 Thread alessandro.benedetti
Hi all,
it may sounds a silly question, but is there any reason that the term
positions in the inverted index are using 1 based numbering while the Term
Vector positions are using a 0 based numbering[1] ?

This may affect different areas in Solr and cause problems which are quite
tricky to spot.

Regards

[1] http://blog.jpountz.net/post/41301889664/putting-term-vectors-on-a-diet



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Spellcheck

2017-11-27 Thread alessandro.benedetti
Do you mean you are over-spellchecking ?
Correcting even "not mispelled words" ?

Can you give us the request handler configuration, spellcheck configuration
and the schema ?

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Embedded SOLR - Best practice?

2017-11-27 Thread alessandro.benedetti
When you say " caching 100.000 docs" what do you mean ?
being able to quickly find information in a corpus which increases in size (
100.000 docs) everyday ?

I second Erick, I think this is fairly normal Solr use case.
If you really care about fast searches, you will get a fairly acceptable
default configuration.
Then, you can tune Solr caching if you need.
Just remember that nowadays by default Solr is optimized for  Near Real Time
Search and it vastly uses the Memory Mapping feature of modern OSs.
This means that Solr is not going to do I/O all the time with the disk but
index portions will be memory mapped (if the memory assigned to the OS is
enough on the machine) .

Furthemore you may use the heap memory assigned to the Solr JVM to cache
additional elements [1] .

In conclusion : I never used the embedded Solr Server ( apart from
integration tests).

If you really want to play a bit with a scenario where you don't need
persistency on disk, you may play with the RamDirectory[2], but also in this
case, I generally discourage this approach unless very specific usecases and
small indexes.

[1]
https://lucene.apache.org/solr/guide/6_6/query-settings-in-solrconfig.html#QuerySettingsinSolrConfig-Caches
[2]
https://lucene.apache.org/solr/guide/6_6/datadir-and-directoryfactory-in-solrconfig.html#DataDirandDirectoryFactoryinSolrConfig-SpecifyingtheDirectoryFactoryForYourIndex



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: TimeZone issue

2017-11-27 Thread alessandro.benedetti
Hi,
it is on my TO DO list with low priority, there is a Jira issue already[1],
feel free to contribute it !

[1] https://issues.apache.org/jira/browse/SOLR-8952






-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Sol rCloud collection design considerations / best practice

2017-11-15 Thread alessandro.benedetti
"The main motivation is to support a geo-specific relevancy 
model which can easily be customized without stepping into each other"

Is your relevancy tuning massively index time based ?
i.e. will create massively different index content based on the geo location
?

If it is just query time based or lightly index based ( few fields of
difference across region), you don't need different collections at all to
have a customized relevancy model per use case.

In Solr you can define different request handlers with different query
parsers and search components specifications.
If you go in deep with relevancy tuning and for example you experiment
Learning To Rank, it supports passing the model name at query time, which
means you can use a different relevancy mode just passing it as a request
parameter.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Spellcheck returning suggestions for words that exist in the dictionary

2017-11-13 Thread alessandro.benedetti
Which Solr version are you using ?

>From the documentation : 
"Only query words, which are absent in index or too rare ones (below
maxQueryFrequency ) are considered as misspelled and used for finding
suggestions.
...
These parameters (maxQueryFrequency and thresholdTokenFrequency) can be a
percentage (such as .01, or 1%) or an absolute value (such as 4)."

Checking in the latest source code[1] : public static final float
DEFAULT_MAXQUERYFREQUENCY = 0.01f;

This means that for the direct Solr Spellcheck, you should not get the
suggestion if the term has a Document Frequency >=0.01 ( so if a term is in
the index ) .
Can you show us the snippet of the result you got ?








-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using Ltr and payload together

2017-11-13 Thread alessandro.benedetti
It depends how you want to use the payloads.

If you want to use the payloads to calculate additional features, you can
implement a payload feature:

This feature could calculate the sum of numerical payload for the query
terms in each document ( so it will be a query dependent feature and will
leverage the encoded indexed payload for the field).

Alternatively you could use the payloads to affect the original Solr score
before the re-ranking happens ( this makes sense only if you use the
original Solr score as a feature) .

I recommend you this blog about playloads [1].

So, long story short, it depends.

[1] https://lucidworks.com/2017/09/14/solr-payloads/



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr - phrase suggestion returning duplicate

2017-11-10 Thread alessandro.benedetti
"In case you decide to use an entire new index for the 
autosuggestion, you 
can potentially manage that on your own"

This refers to the fact that is possible to define an index just for
autocompletion.
You can model the document as you prefer in this additional index, defining
the field types that best fits you and then managing the documents in the
index ( so you can avoid duplicates according to your rules).

Then you can configure a request handler and manage the query side as your
preference.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr - phrase suggestion returning duplicate

2017-11-08 Thread alessandro.benedetti
Hi Ruby,
I partecipated at the discussion at the time,
It's definitely still open.

It's on my long TO DO list, I hope I will be able to contribute a solution
sooner or later.
In case you decide to use an entire new index for the autosuggestion, you
can potentially manage that on your own.
But out of the box, you are going to get that problem.

There is a related issue to solve the problem SolrJ client side[1] but it is
not merged in Solr code either.

[1] https://issues.apache.org/jira/browse/SOLR-8672




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Faceting Word Count

2017-11-08 Thread alessandro.benedetti
Apart from the performance, to get a "word cloud" from a subset of documents
it is a slighly different problem than getting the facets out of it.

If my understanding is correct, what you want is to extract the "significant
terms" out of your results set.[1]

Using faceting is a rough approximation, that may be good enough in your
case.
I second the previous comments and in addition I definitely discourage the
term enum approach if you have million of terms...

[1] https://issues.apache.org/jira/browse/SOLR-9851



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Given path of Ranklib model in Solr Model Json

2017-11-08 Thread alessandro.benedetti
I opened a ticket for RankLib long time ago to provide support for Solr Model
Json format[1]

It is on my TO DO list but unfortunately very low on priority.
Anyone that want to contribute is welcome, I will help and commit it when
ready.

Cheers

[1] https://sourceforge.net/p/lemur/feature-requests/144/



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Date range queries no longer work 6.6 to 7.1

2017-10-24 Thread alessandro.benedetti
I know it is obvious, but ...
 have you done a full re-indexing or you used the Index migration tool ?
In the latter case, it could be a bug of the tool itself.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to Efficiently Extract Learning to Rank Similarity Features From Solr?

2017-10-24 Thread alessandro.benedetti
i think this can be actually a good idea and I think that would require a new
feature type implementation.

Specifically I think you could leverage the existing data structures ( such
TermVector) to calculate the matrix and then perform the calculations you
need.
Or maybe there is space for even a new optional data structure in the index,
to support matrix calculation ( it's been a while I don't take a look to
codecs and index file formats).



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Goal: reverse chronological display Methods? (1) boost, and/or (2) disable idf

2017-10-23 Thread alessandro.benedetti
In addition : bf=recip(ms(NOW/DAY,unixdate),3.16e-11,5,0.1)) is an additive
boost.
I tend to prefer multiplicative ones but that is up to you [1].

You can specify the order of magnitude of the values generated by that
function.
This means that you have control of how much the date will affect the score.
If you decide to go additive be careful with the order of magnitude of the
scores :

Your relevancy score magnitude will variate depending on the query and the
index while your additive boost is going to be < constant.

Regards


[1] https://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: LTR feature extraction performance issues

2017-10-23 Thread alessandro.benedetti
It strictly depends on the kind of features you are using.
At the moment there is just one cache for all the features.
This means that even if you have 1 query dependent feature and 100 document
dependent feature, a different value for the query dependent one will
invalidate the cache entry for the full vector[1].

You may look to optimise your features ( where possible).

[1]  https://issues.apache.org/jira/browse/SOLR-10448



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Facets based on sampling

2017-10-23 Thread alessandro.benedetti
Hi John, 
first of all, I may state the obvious, but have you tried docValues ?

Apart from that a friend of mine ( Diego Ceccarelli) was discussing a
probabilistic implementation similar to the hyperloglog[1] to approximate
facets counting.
I didn't have time to take a look in details / implement anything yet.
But it is on our To Do list :)
He may add some info here.

Cheers




[1]
https://blog.yld.io/2017/04/19/hyperloglog-a-probabilistic-data-structure/



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: AW: Howto verify that update is "in-place"

2017-10-18 Thread alessandro.benedetti
According to the concept of immutability that should drive Lucene segmenting
approach, I think Emir observation sounds correct.

Being docValues a column based data structure, stored on segments i guess
when an in place update happens it does just a re-index of just that field.
This means we need to write a new segment containing the information and
potentially merge it if it is flushed to the disk.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Influencing representing document in grouped search.

2017-10-18 Thread alessandro.benedetti
If you add a filter query to your original query :
fq=genre:A

You know that your results ( group heads included) will just be of that
genre.
So I think we are not getting your question properly.

Can you try to express your requirement from the beginning.
Leave outside grouping or field collapsing at the moment, let's see what is
the best way to solve the requirement in Apache Solr.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Influencing representing document in grouped search.

2017-10-17 Thread alessandro.benedetti
Can results collapsing[1] be of use for you ?
if it is the case, you can use that feature and explore its flexibility in
selecting the group head :

1) min | max for a numeric field
2) min | max for a function query
3) sort

[1]
https://lucene.apache.org/solr/guide/6_6/collapse-and-expand-results.html

Cheers




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using pint field as uniqueKey

2017-10-17 Thread alessandro.benedetti
In addition to what Amrit correctly stated, if you need to search on your id,
especially range queries, I recommend to use a copy field and leave the id
field, almost as default.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: spell-check does not return collations when using search query with filter

2017-10-17 Thread alessandro.benedetti
But you used :

"spellcheck.q": "tag:polt", 

Instead of :
"spellcheck.q": "polt", 

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: E-Commerce Search: tf-idf, tie-break and boolean model

2017-10-16 Thread alessandro.benedetti
I was having the discussion with a colleague of mine recently, about
E-commerce search.
Of course there are tons of things you can do to improve relevancy:
Custom similarity - edismax tuning - basic user events processing - machine
learning integrations - semantic search ect ect

more you do, better the results will potentially be, basically it is an
ocean to explore.
To avoid going off topic and being pertinent to your initial request, let's
take a look to the custom similarity problem.

In e-commerce, and generally in proper nouns searches TF is not relevant.
IDF can help, but we need to focus on what IDF is used for in general, in
lucene search :
Mostly IDF is a measure of "how much this term is important in the user
query".
Basically Lucene ( and in general TF/IDF based Information Retrieval systems
) assume that more a term is rare in the corpus, more likely it is that it
is important for the search query.
That is not always true in e-commerce :
"iphone cover" means the user is looking for a cover, which is good for
his/her phone.
iphone is rare. Cover is not. IDF will recognize "Iphone" to be the most
pertinent term to the user intent.
There's a lot to talk in here, let's stop :)

Anyway as a conclusion, go step by step, custom similarity + edismax
optimised with proper phrase and shingle boosts should be a good start.
Tie-breaking for e-commerce is likely to be ok, set to the default.
But to discover that I would recommend to set up a relevancy measuring
framework with golden queries and users feedback.

cheers





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: HOW DO I UNSUBSCRIBE FROM GROUP?

2017-10-16 Thread alessandro.benedetti
The Terms component[1] should do the trick for you.
Just use the regular expression or prefix filtering and you should be able
to get the stats you want.

If you were interested in extracting the DV when returning docs you may be
interested in function queries and specifically this one :

docfreq(field,val)
"Returns the number of documents that contain the term in the field. This is
a constant (the same value for all documents in the index).

You can quote the term if it’s more complex, or do parameter substitution
for the term value.
docfreq(text,'solr')"

…​=func =docfreq(text,$myterm)=solr



[1] https://lucene.apache.org/solr/guide/6_6/the-terms-component.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Strange Behavior When Extracting Features

2017-10-16 Thread alessandro.benedetti
This is interesting, the EFI parameter resolution should work using the
quotes independently of the query parser.
At that point, the query parsers (both) receive a multi term text.
Both of them should work the same.
At the time I saw the mail I tried to reproduce it through the LTR module
tests and I didn't succeed .
It would be quite useful if you can contribute a test that is failing with
the field query parser.
Have you tried just with the same query, but in a request handler ?



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: spell-check does not return collations when using search query with filter

2017-10-16 Thread alessandro.benedetti
Interesting, what happens when you pass it as spellcheck.q=polt ?
What is the behavior you get ?





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Appending fields to pre-existed document

2017-10-13 Thread alessandro.benedetti
Hi,
"And all what we got 
only a overwriting doc came first by new one. Ok just put overwrite=false 
to params, and dublicating docs appeare."

What is exactly the doc you get ?
Are the fields originally in the first doc before the atomic update stored ?
This is what you need to use :

https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html

If you don't, Solr by default will just overwrite the entire document.




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr related questions

2017-10-13 Thread alessandro.benedetti
The only way Solr will fetch documents is through the Data Import Handler.
Take a look to the URLDataSource[1] to see if it fits.
Possibly you will need to customize it.

[1]
https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#urldatasource



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr related questions

2017-10-13 Thread alessandro.benedetti
Nabble mutilated my reply :

*Comment*: If you remove this field, you must _also_ disable the update log
in solrconfig.xml
  or Solr won't start. _version_ and update log are required for
SolrCloud
   
   
*Comment*:points to the root document of a block of nested documents.
Required for nested
  document support, may be removed otherwise
   

*Comment*:Only remove the "id" field if you have a very good reason to.
While not strictly
 required, it is highly recommended. A  is present in almost
all Solr 
 installations. See the  declaration below where 
is set to "id".




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr related questions

2017-10-13 Thread alessandro.benedetti
1) "_version_" is not "unecessary", actually the contrary, it is fundamendal
for Solr to work.
The same for types you use across your field definitions.
There was a time you could see these comments in the schema.xml (doesn't
seem the case anymore):

 
   
   
   
   

  


2) https://lucene.apache.org/solr/guide/6_6/schema-api.html , yes you can

3)Unless your files are local to the process you will use to push them to
Solr, you will have "two times traffic" indipendently of the client
technology.

Cheers



[1]



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Parsing of rq queries in LTR

2017-10-12 Thread alessandro.benedetti
I don't think this is actually that much related to LTR Solr Feature.
In the Solr feature I see you specify a query with a specific query parser
(field).
Unless there is a bug in the SolrFeature for LTR, I expect the query parser
you defined to be used[1].

This means :

"rawquerystring":"{!field f=full_name}alessandro benedetti",
"querystring":"{!field f=full_name}alessandro benedetti",
"parsedquery":"PhraseQuery(full_name:\"alessandro benedetti\")",
"parsedquery_toString":"full_name:\"alessandro benedetti\"",

In relation to multi term EFI, you need to pass 
efi.example='term1 term2' .
If not just one term will be passed as EFI.[2]
This is more likely to be your problem.
I don't think the dash should be relevant at all

[1]
https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-FieldQueryParser
[2] https://issues.apache.org/jira/browse/SOLR-11386




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr staying constant on popularity indexes

2017-10-10 Thread alessandro.benedetti
In line :

/"1. No zookeeper  - I have burnt my hands with some zookeeper issues in the
past and it is no fun to deal with. Kafka and Storm are also trying to
burden zookeeper less and less because ZK cannot handle heavy traffic."/

Where did you get this information ? is based on some publicly
report/analysis/stress test or based on experience ?
Anyway


/"3. Client nodes - No such equivalent in Solr. All nodes do scatter-gather
in Solr which adds scalability problems."/

Solr has not such thing, but I would say it is moving in that direction [1]
adding different types of replicas.

Anyway I agree with you , it is always useful to look for the weak points (
and having another great product for comparation is very useful).

[1]
https://lucene.apache.org/solr/guide/7_0/shards-and-indexing-data-in-solrcloud.html#types-of-replicas



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Newbie question about why represent timestamps as "float" values

2017-10-10 Thread alessandro.benedetti
There was time ago a Solr installation which had the same problem, and the
author explained me that the choice was made for performance reasons.
Apparently he was sure that handling everything as primitive types would
give a boost to the Solr searching/faceting performance.
I never agreed ( and one of the reasons is that you need to transform back
from float to dates to actually render them in a readable format).

Furthermore I tend to rely on standing on the shoulders of giants, so if a
community ( not just a single developer) spent time implementing a date type
( with the different available implementations) to manage specifically date
information, I tend to thrust them and believe that the best approach to
manage dates is to use that ad hoc date type ( in its variants, depending on
the use cases).

As a plus, using the right data type gives you immense power in debugging
and understanding better your data.
For proper maintenance , it is another good reason to stick with standards.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Semantic Knowledge Graph

2017-10-09 Thread alessandro.benedetti
I expect the slides to be published here :


https://www.slideshare.net/lucidworks?utm_campaign=profiletracking_medium=sssite_source=ssslideview

The one you are looking for is not there yet, but keep an eye on it.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: spell-check does not return collations when using search query with filter

2017-10-09 Thread alessandro.benedetti
Does spellcheck.q=polt help ?
How your queries normally look ?
How would you like the collation to be returned ?



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Rescoring from 0 - full

2017-10-09 Thread alessandro.benedetti
The weights you express could flag a probabilistic view or your final score.
The model you quoted will calculate the final score as :
0.9*scorePersonalId +0.1* originalScore

The final score will NOT necessarily be  0

Re: when transaction logs are closing?

2017-10-09 Thread alessandro.benedetti
In addition to what Emir mentioned, when Solr opens a new Transaction Log
file it will delete the older ones up to some conditions :
keep at least N number of records [1] and max K number of files[2].
N is specified in the solrconfig.xml ( in the update handler section) and
can be documents related or files related or both.
So , potentially it could delete no one.

This blog from Erick is quite explicative[3] .
If you like to take a look to the code, this class should help[4]



[1]  ${solr.ulog.numRecordsToKeep:100}
[2]  ${solr.ulog.maxNumLogsToKeep:10}
[3]
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
[4] org.apache.solr.update.UpdateLog




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr boost function taking precedence over relevance boosting

2017-10-05 Thread alessandro.benedetti
I would try to use an additive boost and the ^= boost operator:
- name_property :( test^=2 ) will assign a fixed score of 2 if the match
happens ( it is a constant score query)
- additive boost will be 0

Re: length of indexed value

2017-10-04 Thread alessandro.benedetti
Are the norms a good approximation for you ?
If you preserve norms at indexing time ( it is a configuration that you can
operate in the schema.xml) you can retrieve them with this specific function
query :

*norm(field)*
Returns the "norm" stored in the index for the specified field. This is the
product of the index time boost and the length normalization factor,
according to the Similarity for the field.
norm(fieldName)

This will not be the exact length of the field, but it can be a good
approximation though.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr cloud without hard commit?

2017-10-02 Thread alessandro.benedetti
Hi Erick,
you said :
""mentions that for soft commit, "new segments are created that will 
be merged"" 

Wait, how did that get in there? Ignore it, I'm taking it out. "

but I think you were not wrong, based on another mailing list thread message
by Shawn, I read :
[1]

"If you are using the correct DirectoryFactory type, a soft commit has 
the *possibility* of not writing to disk, but the amount of memory 
reserved is fairly small. 

Looking into the source code for NRTCachingDirectoryFactory, I see that 
maxMergeSizeMB defaults to 4, and maxCachedMB defaults to 48.  This is a 
little bit different than what the javadoc states for 
NRTCachingDirectory (5 and 60): 

http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/store/NRTCachingDirectory.html

The way I read this, assuming the amount of segment data created is 
small, only the first few soft commits will be entirely handled in 
memory.  After that, older segments must be flushed to disk to make room 
for new ones. 

If the indexing rate is high, there's not really much difference between 
soft commits and hard commits.  This also assumes that you have left the 
directory at the default of NRTCachingDirectoryFactory.  If this has 
been changed, then there is no caching in RAM, and soft commit probably 
behaves *exactly* the same as hard commit. 
"

[1]
http://lucene.472066.n3.nabble.com/High-disk-write-usage-td4344356.html#a4344551



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Distributed IDF configuration query

2017-10-02 Thread alessandro.benedetti
Hi Reth,
there are some problem in the debug for the distributed IDF [1]

Your case seems different though.
It has been a while I experimented that feature but your config seems ok to
me.
What helped me a lot that time was to debug my Solr instance.



[1] https://issues.apache.org/jira/browse/SOLR-7759



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Keeping the index naturally ordered by some field

2017-10-02 Thread alessandro.benedetti
Hi Alex,
just to explore a bit your question, why do you need that ?
Do you need to reduce query time ?
Have you tried enabling docValues for the fields of interest ?
Doc Values seem to me a pretty useful data structure when sorting is a
requirement.
I am curious to understand why that was not an option.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: SOLR terminology

2017-09-28 Thread alessandro.benedetti
>From the Solr wiki[1] : 

*Logical*
/Collection/ : It is a collection of documents which share the same logical
domain and data structure

*Physical*
/Solr Node/ : It is a single instance of a Solr Server. From OS point of
view it is a single Java Process ( internally it is the Solr Web App
deployed in a Jetty Server)
/Solr Core/ : It is a single Index ( with its own configuration) within a
single Solr instance. It is the physical counterpart of a collection( or a
collection shard if the collection is fragmented)
/Solr Cluster /: It is a group of Solr Instances which collaborates through
the supervision of Apache zookeeper instance(s)


[1] https://lucene.apache.org/solr/guide/6_6/how-solrcloud-works.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Strange Behavior When Extracting Features

2017-09-22 Thread alessandro.benedetti
I think this has nothing to do with the LTR plugin.
The problem here should be just the way you use the local params,
to properly pass multi term local params in Solr you need to use *'* :

efi.case_description='added couple of fiber channel'

This should work.
If not only the first term will be passed as a local param and then passed
in the efi map to LTR.

I will update the Jira issue as well.

Cheers





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Cannot load LTRQParserPlugin inot my core

2017-09-20 Thread alessandro.benedetti
Hi Billy,
there is a README.TXT in the contrib/ltr directory.
Reading that you find this useful link[1] .
>From that useful link you see where the Jar of the plugin is located.
Specifically :



Taking a look to the  contrib and dist structure it seems quite a standard
approach to keep the readme in the contrib ( while in the source code the
contrib modules contain the plugins code).
The Solr binaries are located in the dist directory.
External libraries are in contrib.


[1]
https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank#LearningToRank-Installation



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr returning same object in different page

2017-09-19 Thread alessandro.benedetti
Which version of Solr are you on?
Are you using SolrCloud or any distributed search?
In that case, I think( as already mentioned by Shawn) this could be related
[1] .

if it is just plain Solr, my shot in the dark is your boost function :

{!boost+b=recip(ms(NOW,field1),3.16e-11,1,1)}{!boost+b=recip(ms(NOW,field2),3.16e-11,1,1)}
 

I see you use NOW ( which changes continuosly).
it is normally suggested to round it ( for example NOW/HOUR or NOW/DAY).
The rounding granularity depends on the use case.

Time passing should not bring any change in ranking ( but it brings change
in the score).
I can imagine that if for any reason of rounding the score, we end up in
having different documents with the same score, then the internal ordinal
will be used for ranking them, bringing slightly different rankings.
This is very unlikely, but if we are using a single Solr, it's the first
thing that jumps to my mind.

[1] https://issues.apache.org/jira/browse/SOLR-5821
[2] https://github.com/fguery/lucene-solr/tree/replicaChoice




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Search by similarity?

2017-09-19 Thread alessandro.benedetti
In addition to that, I still believe More Like This is a better option for
you.
The reason is that the MLT is able to evaluate the interesting terms from
your document (title is the only field of interest for you), and boost them
accordingly.

Related your "80% of similarity", this is more tricky.
You can potentially calculate the score of the identical document and then
render the score of the similar ones normalised based on that.

Normally it's useless to show the score value per se, but in the case of MLT
it actually make sense to give a percentage score result.
Indeed it could be a good addition to the MLT.

Regards





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Knn classifier doesn't work

2017-09-18 Thread alessandro.benedetti
Hi Tommaso,
you are definitely right!
I see that the method : MultiFields.getTerms
returns :
 if (termsPerLeaf.size() == 0) {
  return null;
}

As you correctly mentioned this is not handled in :

org/apache/lucene/classification/document/SimpleNaiveBayesDocumentClassifier.java:115
org/apache/lucene/classification/document/SimpleNaiveBayesDocumentClassifier.java:228
org/apache/lucene/classification/SimpleNaiveBayesClassifier.java:243

Can you do the change or should I open a Jira issue and attach the simple
patch for you to commit?
let me know,

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Apache Solr 4.10.x - Collection Reload times out

2017-09-18 Thread alessandro.benedetti
I finally have an explanation, I post it here for future reference :

The cause was a combination of :

1) /select request handler has default with the spellcheck ON and few
spellcheck options ( such as collationQuery ON and max collation tries set
to 5)

2) the firstSearcher has a warm-up query with a lot of terms

Basically when opening the searcher, I found that there was a thread stuck
in waiting and that thread was the one responsible for the collation query.
Basically the Searcher was never finishing to be opened, because of the
collation to be calculated over the big multi term warm-up query.

Lesson Learned : be careful with defaults in the default request handler, as
they may be used by other components ( then just user searches)

Thanks for the support!

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr - google like suggestion

2017-09-18 Thread alessandro.benedetti
If you are referring to the number of words per suggestion, you may need to
play with the free text lookup type [1]

[1] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Copy field a source of copy field

2017-07-26 Thread alessandro.benedetti
I get your point, the second KeepWordFilter is not keeping anything because
the token it gets is :
"hey you" and the word is supposed to keep is "hey".
Which does clearly not work.

The KeepWordFilter just consider each row a single token ( I may be wrong, i
didn't check the code, I am just asssuming based on your observations).
If you want, you can put a WordDelimiterFilter between the 2 KeepWordFilter.
Configure the WordDelimiterFilter to split on space ( I need to double
check, but it should be possible).

OR
You simply do as Erick suggested, and you just keep the genera in the genus
field.
But as Erick mentioned, you may have problems of entity recognition.







-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4347731.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: FreeTextSuggester throwing error "token must not contain separator byte"

2017-07-25 Thread alessandro.benedetti
I think this bit is the problem :

"I am using a Shingle filter right after the StandardTokenizer, not sure if 
that has anything to do with it. "

When using the FreeTextLookup approach, you don't need to use shingles in
your analyser, shingles are added by the suggester itself.
As Erick mentioned, the reason spaces come back is because you produce
shingles on your own and then the Lookup approach will add additional
shingles.

I recommend to read this section of my blog [1] ( you may have read it as
there is one comment with a similar problem to you)


[1] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/FreeTextSuggester-throwing-error-token-must-not-contain-separator-byte-tp4347406p4347454.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-24 Thread alessandro.benedetti
1) nope, no big tlog or replaying problem

2) Solr just seem freezed. Not responsive and nothing in the log.
Now I just tried just to restart after the Zookeeper config deploy and on
restart the log complety freezes and the instances don't come up...
If I clean the indexes and then start, this works.
Solr is deployed in Jboss, so I don't know if the stop is too aggressive and
breaks something.

3) No problem at all!

I will continue with some analysis.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4347347.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: LambdaMART XML model to JSON

2017-07-24 Thread alessandro.benedetti
hi Ryan,
the issue you mentioned was mine :
https://sourceforge.net/p/lemur/feature-requests/144/

My bad It got lost in sea of "To Dos" .
I still think it could be a good contribution to the library, but at the
moment I think going with a custom script/app to do the transformation is
the way to go.





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/LambdaMART-XML-model-to-JSON-tp4347277p4347343.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-20 Thread alessandro.benedetti
Additional information :
Try single core reload I identified that an entire shard is not reloading (
while the other shard is ).
Taking a look to the "not reloading" shard ( 2 replicas) , it seems that the
core reload stucks here :

org.apache.solr.core.SolrCores#waitAddPendingCoreOps

The problem is that the wait seems to continue indefinitely and silently.
Apart a restart, is there any way to clean up the pending core operations ?
I will continue my investigations




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4346966.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-20 Thread alessandro.benedetti
Taking a look to 4.10.2 source I may see why the async call does not work :

/log.info("Reloading Collection : " + req.getParamString());
String name = req.getParams().required().get("name");

*ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION,
OverseerCollectionProcessor.RELOADCOLLECTION, "name", name);*

handleResponse(OverseerCollectionProcessor.RELOADCOLLECTION, m, rsp);
/

Are we sure we are actually passing the "async" param as a ZkNodeProp ?
Because the handleResponse does :

private void handleResponse(String operation, *ZkNodeProps m*,
  SolrQueryResponse rsp, long timeout)
...
if(m.containsKey(ASYNC) && m.get(ASYNC) != null) {
 
   String asyncId = m.getStr(ASYNC);
...



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4346949.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread alessandro.benedetti
Assuming the service solr service restart does its job, I think the only
thing I would do is to completely remove the data directory content, instead
of just running the delete query.

Bare in mind that when you delete a document in Solr, this is marked as
deleted, but it takes potentially a while until it really leaves the index (
after a successful segment merge).
This could bring to potential conflict in the data structures when documents
of different schemas are in the index.
I don't know if it is your case, but I would double check.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiValued-false-is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346945.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread alessandro.benedetti
I doubt it is an environment problem at all.
How are you modifying your schema ?
How you reloading your core/collection ?
Are you restarting your Solr instance ?

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiValued-false-is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346941.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-20 Thread alessandro.benedetti
Thanks for the prompt response Erick,
the reason that I am issuing a Collection reload is because I modify from
time to the time the Solrconfig for example, with different spellcheck and
request parameter default params.
So after the upload to Zookeeper I reload the collection to reflect the
modification.
Aliasing is definitely a valid option but at the moment I don't have set up
the infrastructure necessary to programmatically operate that.

Returning to my issue, I see no effect at all if I try to run the request
async ( it seems like it is completely ignoring the parameter) .

http://blabla:8983/solr/admin/collections?action=RELOAD=news=55

I checked the source code and the async param seems to be in 4.10.2 version,
so this is really weird.
I will proceed with my investigations.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4346940.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Get results in multiple orders (multiple boosts)

2017-07-18 Thread alessandro.benedetti
"I have different "sort preferences", so I can't build a index and use for
sorting.Maybe I have to sort by category then by source and by language or
by source, then by category and by date"

I would like to focus on this bit.
It is ok to go for a custom function and sort at query time, but I am
curious to explore why an index time solution should not be ok.
You can have these distinct fields :
source_priority
language_priority
category_priority 
ect

This values can be assigned at the documents at indexing time ( using for
example a custom update request processor).
Then at query time you can easily sort on those values in a multi layered
approach :
sort:source_priority desc, category_priority  desc
Of course, if the priority for a source changes quite often or if it's user
dependent, a query time solution would be preferred.





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Get-results-in-multiple-orders-multiple-boosts-tp4346304p4346559.html
Sent from the Solr - User mailing list archive at Nabble.com.


Apache Solr 4.10.x - Collection Reload times out

2017-07-14 Thread alessandro.benedetti
I have been recently facing an issue with the Collection Reload in a couple
of Solr Cloud clusters :

1) re-index a collection
2) collection happily working
3) trigger collection reload 
4) reload times out ( silently, no message in any of the Solr node logs)
5) no effect on the collection ( it still serves query)

If I restart, the collection doesn't start as it finds the write.lock in the
index.
Sometimes this even avoid the entire cluster to be restarted ( even if the
clusterstate.json actually shows only few collection down) and Solr is not
reachable.
Of course i can mitigate the problem just cleaning up the indexes and
restart (avoiding the reload in favor of just restarts in the future), but
this is annoying.

I index through the DIH and I use a DirectSolrSpellChecker .
Should I take a look into Zookeeper ? I tried to check the Overseer queues
and some other checks, not sure the best places to look though in there...

Could this be related ?[1] I don't think so, but I am a bit puzzled...

[1] https://issues.apache.org/jira/browse/SOLR-6246






-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: suggestors on shingles

2017-07-13 Thread alessandro.benedetti
To do what ?
If it is a use case, please explain us.

If it is just to check that the analysis chain worked correctly, you can
check the schema browser or use Luke.

If you just want to test your analysis chain, you can use the analysis tool
in the Solr admin.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/suggestors-on-shingles-tp4345763p4345836.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Do I need to declare TermVectorComponent for best MoreLikeThis results?

2017-07-13 Thread alessandro.benedetti
You don't need the TermVectorComponent at all for MLT.

The reason the Term Vector is suggested for the fields you are interested
in, is just because this will speed up the way the MLT will retrieve the
"interesting terms" out of your seed document to build the MLT query.

If you don't have the Term Vector enabled, the MLT will analyse the content
of fields on the fly with the analysis chain in input.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-I-need-to-declare-TermVectorComponent-for-best-MoreLikeThis-results-tp4345646p4345794.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: suggestors on shingles

2017-07-13 Thread alessandro.benedetti
I would recommend this blog of mine to get a better understanding of how
tokenization and the suggester work together [1] .

If you take a look to the FuzzyLookupFactory, you will see that it is one of
the suggesters that return the entire content of the field.

You may be interested to the FreeTextLookupFactory.

Cheers


[1] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/suggestors-on-shingles-tp4345763p4345793.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: enable fault-tolerance by default on collection?

2017-07-13 Thread alessandro.benedetti
I would recommend to play with the default, append and invariants [1] element
for the reuqestHandler node.
Identify the request handler you want to use in the solrconfig.xml and then
add the parameter you want.
You should be abkle to manage this through your source version control
system.

Cheers

[1]
https://cwiki.apache.org/confluence/display/solr/RequestHandlers+and+SearchComponents+in+SolrConfig#RequestHandlersandSearchComponentsinSolrConfig-SearchHandlers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/enable-fault-tolerance-by-default-on-collection-tp4345780p4345792.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Collections API Overseer Status

2017-07-12 Thread alessandro.benedetti
+1 
I was trying to understand a reload collection time out happening lately in
a Solr Cloud cluster, and the Overseer Status was hard to decipher.

More Human Readable names and some additional documentation could help here.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Collections-API-Overseer-Status-tp4345454p4345567.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: High disk write usage

2017-07-05 Thread alessandro.benedetti
Point 2 was the ram Buffer size :

*ramBufferSizeMB* sets the amount of RAM that may be used by Lucene
 indexing for buffering added documents and deletions before they
are
 flushed to the Directory.
 maxBufferedDocs sets a limit on the number of documents buffered
 before flushing.
 If both ramBufferSizeMB and maxBufferedDocs is set, then
 Lucene will flush based on whichever limit is hit first.  

100
1000




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/High-disk-write-usage-tp4344356p4344386.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: High disk write usage

2017-07-05 Thread alessandro.benedetti
Is the phisical machine dedicated ? Is a dedicated VM on shared metal ?
Apart from this operational checks I will assume the machine is dedicated.

In Solr a write to the disk does not happen only on commit, I can think to
other scenarios :
1) *Transaction log* [1] 
2) 



3) Spellcheck and SuggestComponent  building ( this depends on the config in
case you use them)

4) memory Swapping ?

5) merges ( they are triggered potentially by a segment writing or an
explicit optimize call and they can last a while potentially)

Maybe other edge cases, but i would first check this list!

[1]
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/High-disk-write-usage-tp4344356p4344383.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Same score for different length matches

2017-07-03 Thread alessandro.benedetti
In addition to what Chris has correctly suggested, I would like to focus on
this sentence :
"  I am decently certain that at one point in time it worked in a way 
that a higher match length would rank higher"

You mean a match in a longer field would rank higher than a match in a
shorter field ?
is that what you want ( because it is counter intuitive) ?

Furthermore I see that some stemming is applied at query time , is that what
you want ?




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Same-score-for-different-length-matches-tp4343660p4343917.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: cursorMark / Deep Paging and SEO

2017-06-30 Thread alessandro.benedetti
Hi Jacques,
this should satisfy your curiosity [1].
The mark is telling you the relative position in the sorted set ( and it is
mandatory to use the uniqueKey as tie breaker).
If you change your index, the query using an old mark should still work (but
may potentially return different docuements if their sorting value changed)
I think it fits better in a sort of "infinite scrolling" approach.
If you want to just jump to page N , I think the old school paging is a
better fit ?
This is what i was quickly able to find at the moment, happy to hear more
opinions!





[1]
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html#how-cursors-are-affected-by-index-updates



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/cursorMark-Deep-Paging-and-SEO-tp4343617p4343698.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Suggester and fuzzy/infix suggestions

2017-06-29 Thread alessandro.benedetti
Another path to follow could be to design a specific collection(index) for
the auto-suggestion.
In there you can define the analysis chain as you like ( for example using
edge-ngram filtering on top of tokenisation) to provide infix
autocompletion.
Then you can play with your queries as you like and potentially run fuzzy
queries.

Under the hood the AnalyzingInfixLookupFactory is using an internal
auxiliary Lucene index, so won't be that different.
If you don't want to go with an external index, you could potentially just
add an additional field with the analysis you like in the current
collection.




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggester-and-fuzzy-infix-suggestions-tp4343225p4343382.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR Suggester returns either the full field value or single terms only

2017-06-27 Thread alessandro.benedetti
Hi Angel, 
can you give me an example of query, a couple of documents of example, and
the suggestions you get ( which you don't expect) ?

The config seems fine ( I remember there were some tricky problems with the
default separator, but a space should be fine there).

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-Suggester-returns-either-the-full-field-value-or-single-terms-only-tp4342763p4342987.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR Suggester returns either the full field value or single terms only

2017-06-26 Thread alessandro.benedetti
" Don't use an heavy Analyzers, the suggested terms will come from the index,
so be sure they are meaningful tokens. A really basic analyser is suggested,
stop words and stemming are not "

This means that your suggestions will come from the index, so if you use
heavy analysers you can get terms suggested which are not really useful :

e.g.

Solr is an amazing search engine

If you have some stemmer in your analysis chain, you will have this behavior
:

q= ama
result : amaz search engin

So it is better to have this lookup strategy configured on top of a light
analysed field ( or copyfield).





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-Suggester-returns-either-the-full-field-value-or-single-terms-only-tp4342763p4342807.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR Suggester returns either the full field value or single terms only

2017-06-26 Thread alessandro.benedetti
Hi Angel,
your are looking for the Free Text lookup approach.
You find more info in [1] and [2]

[1]
https://lucene.apache.org/solr/guide/6_6/suggester.html#Suggester-FreeTextLookupFactory
[2] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-Suggester-returns-either-the-full-field-value-or-single-terms-only-tp4342763p4342790.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query Partial Matching on auto schema

2017-06-23 Thread alessandro.benedetti
Quoting the official solr documentation : 
" You Can Still Be Explicit
Even if you want to use schemaless mode for most fields, you can still use
the Schema API to pre-emptively create some fields, with explicit types,
before you index documents that use them.

Internally, the Schema API and the Schemaless Update Processors both use the
same Managed Schema functionality."

Even using schemaless you can use the managed schema APi to define your own
field types and fields.

For more info [1]

[1]
https://lucene.apache.org/solr/guide/6_6/schemaless-mode.html#SchemalessMode-EnableManagedSchema



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-Partial-Matching-on-auto-schema-tp4342502p4342509.html
Sent from the Solr - User mailing list archive at Nabble.com.


[Solr Ref guide 6.6] Search not working

2017-06-23 Thread alessandro.benedetti
Hi all,
I was just using the new Solr Ref Guide[1] (If I understood correctly this
is going to be the next official documentation for Solr).

Unfortunately search within the guide works really bad...
The autocomplete seems to be just on page title ( including headings would
help a lot).
If you don't accept any suggestion, it doesn't allow to search (!!!).
I tried on Safari and Chrome.

For a Reference guide of a search engine is not nice to have the search
feature in this status.
Actually, being an entry point for developers and users interested in Solr,
it should showcase an amazing and intuitive search and ease life of people
looking for documentation.
I may state the obvious, so concretely is anybody working to fix this ? Is
this because it has not been released officially yet ?


[1] https://lucene.apache.org/solr/guide/6_6/



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Ref-guide-6-6-Search-not-working-tp4342508.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query Partial Matching on auto schema

2017-06-23 Thread alessandro.benedetti
With automatic schema do you mean schemaless ?
You will need to define a schema managed/old legacy style as you prefer.

Then you define a field type that suites your needs ( for example with an
edge n-gram token filter[1] ).
And you assign that field type to a specific field.

Than in your request handler/ when you build your query just use that field
to search.

Regards

[1]
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-EdgeN-GramFilter



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-Partial-Matching-on-auto-schema-tp4342502p4342506.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Collection name in result

2017-06-23 Thread alessandro.benedetti
I second Erick,
it would be as easy as adding this field to the schema :

"/>

If you are using inter collections queries, just be aware there a lot of
tricky and subtle problems with it ( such as unique Identifier must have
same field name, distributed IDF inter collections ect ect)
I am preparing a blog post related that.
I will keep you updated.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Collection-name-in-result-tp4342474p4342501.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Mixing distrib=true and false in one request handler?

2017-06-22 Thread alessandro.benedetti
A short answer seems to be No [1] .

On the other side I discussed in a couple of related Jira issues in the past
as I( + other people) believe we should anyway always return unique
suggestions [2] .

Despite it passed a year, nor me nor others have actually progressed on that
issue :(







[1] org.apache.solr.spelling.suggest.SuggesterParams
[2] https://issues.apache.org/jira/browse/SOLR-8672 and mostly
https://issues.apache.org/jira/browse/LUCENE-6336



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mixing-distrib-true-and-false-in-one-request-handler-tp4342229p4342310.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spatial Search based on the amount of docs, not the distance

2017-06-21 Thread alessandro.benedetti
As any other search you can paginate playing with the 'row' and 'start'
parameters ( or cursors if you want to go deep), show only the first K
results is your responsibility.

Is it not possible in your domain to identify a limit d ( out of that your
results will lose meaning ?)

You can not match documents based on the score, first you match and then you
score.
After you have scored and you ranked your results by distance, you can
return the top K as any other query.

If there are other criterias for you to match the documents you can just
boost by distance[1] and then return the top K you like.


[1] https://cwiki.apache.org/confluence/display/solr/Spatial+Search 



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spatial-Search-based-on-the-amount-of-docs-not-the-distance-tp4342108p4342142.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: When to use LTR

2017-06-21 Thread alessandro.benedetti
Hi Ryan,
first thing to know is that Learning To Rank is about relevancy and
specifically it is about to improve your relevancy function.
Deciding if to use or not LTR has nothing to do with your index size or
update frequency ( although LTR brings some performance consideration you
will need to evaluate) .

Functionally, the moment you realize you want LTR is when you start tuning
your relevancy.
Normally the first approach is the manual one, you identify a set of
features, interesting for your use case and you tune a boosting function to
improve your search experience.

e.g.
you decide to weight more the title field than the content and then boosting
recent documents.

What happens next is : 
"How much should I weight more the title ?"
"How much should I boost recent documents ?"

Normally you just check some golden queries and you try to manually optimise
these boosting factors by hand.

LTR answers to this requirements.
To make it simple LTR will bring you a model that will tell you the best
weighting factors given your domain ( and past experience) to get the most
relevant results for all the queries ( this is the ideal, of course it is
quite complicated and it depends of a lot of factors)

Of course it doesn't work like magic and you will need to extensively design
your features ( features engineering), build a valid training set ( explicit
or implicit), decide the model that best suites your needs ( linear model or
Tree based ?) and a lot of corollary configurations.

hope this helps!





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/When-to-use-LTR-tp4342130p4342140.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Give boost only if entire value is present in Query

2017-06-20 Thread alessandro.benedetti
Interesting.
it seems almost correct to me.
Have you explored the content of the field ( for example using the schema
browser) ?
When you say " don't match" it means you don't get results at all or just
the boost is not applied ?
I would recommend to simply the request handler, maybe just introducing a
piece step by step and verifying you are getting what you want.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Give-boost-only-if-entire-value-is-present-in-Query-tp4341714p4341951.html
Sent from the Solr - User mailing list archive at Nabble.com.


  1   2   >