Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Tim Casey
People usually want to do some analysis during index time. This analysis should be considered 'expensive', compared to any single query run. You can think of it as indexing every day, over a 86400 second day, vs a 200 ms query time. Normally, you want to index as honestly as possible. That is,

Re: Dynamic Stopwords

2020-05-15 Thread Tim Casey
What I have done for this in the past is calculating the expected value of a symbol within a universe. Then calculating the difference between expected value and the actual value at the time you see a symbol. Take the difference and use the most surprising symbols, in rank order from most

Re: Dynamic Stopwords

2020-05-15 Thread Tim Casey
You do not need stop words to do what you need to do, For one thing, stop words requires a segmentation on a phrase-by-phrase basis in some cases. That is, especially in places like Europe, there is a lot of mixed language. (Your milage may vary :). In order to do what you want, you really need

Re: cursorMark and shards? (6.6.2)

2020-02-10 Thread Tim Casey
Walter, When you do the query, what is the sort of the results? tim On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood wrote: > I’ll back up a bit, since it is sort of an X/Y problem. > > I have an index with four shards and 17 million documents. I want to dump > all the docs in JSON, label

Re: Position search

2019-10-16 Thread Tim Casey
Adi, If you are looking for something specific you might want to try something different. Before you would search 'the end of a document', you might think about segmenting the document and searching specific segments. At the end of a lot of things like email will be signatures. Those are

Re: Position search

2019-10-15 Thread Tim Casey
If this is about a normalized query, I would put the normalization text into a specific field. The reason for this is you may want to search the overall text during any form of expansion phase of searching for data. That is, maybe you want to know the context of up to the 120th word. At least

Re: Re: Need urgent help with Solr spatial search using SpatialRecursivePrefixTreeFieldType

2019-09-30 Thread Tim Casey
https://stackoverflow.com/questions/48348312/solr-7-how-to-do-full-text-search-w-geo-spatial-search On Mon, Sep 30, 2019 at 10:31 AM Anushka Gupta < anushka_gu...@external.mckinsey.com> wrote: > Hi, > > I want to be able to filter on different cities and also sort the results > based on

Re: Encrypting Solr Index

2019-06-25 Thread Tim Casey
My two cents worth of comment, For our local lucene indexes we use AES encryption. We encrypt the blocks on the way out, decrypt on the way in. We are using a C version of lucene, not the java version. But, I suspect the same methodology could be applied. This assumes the data at rest is the

Re: Solr query with long query

2019-05-30 Thread Tim Casey
Venkat, There is another way to do this. If you have a category of "thing" you are attempting to filter over, then you create a query and tag the documents with this category. So, create a 'categories' field and append 'thing' to the field updating the field if need be. (Be wary of over

Re: Help with multi-lang searches

2018-10-22 Thread Tim Casey
Hi Sambhav, Calculate the percentage of letter pairs per language in the index. Given the letter pairs in the incoming token, find the closest "match" for the languages in the indexes. Even on a small number of tokens you will get close to the intended language. You can also calculate the

Re: solr crypto mining hack...

2018-08-25 Thread Tim Casey
I am not sure how solr is exactly set up currently, much less on any specific system. But, for operations which are largely reading, *maybe* like a query, you might be able run on a read only partition. A firewall is a lot less work and a good start, like 90% of the problem. To do this, you

Re: Exact Phrase search not returning results.

2018-07-20 Thread Tim Casey
Deepti, I am going to guess the analyzer part of the .net application is cutting off the last token. If you try the queries on the console of the running solr cluster, what do you get? If you dump that specific field for all the docs, can you find it with grep? tim On Fri, Jul 20, 2018 at

Re: Zookeeper 3.4.12 with Solr 6.6.2?

2018-05-22 Thread Tim Casey
We have 3.4.10 and have *tested* at a functional level 6.6.2. So far it works. We have not done any stress/load testing. But would have to do this prior to release. On Tue, May 22, 2018 at 9:44 AM, Walter Underwood wrote: > Is anybody running Zookeeper 3.4.12 with Solr

Re: Date Query Confusion

2018-05-17 Thread Tim Casey
A simple date range query does not really represent how people query over time and dates. If you want any form of date queries, above a single range, then a special field allowing tokenized query will be the only way to find documents. A query for 'ever tuesday in november of 2017' would have to

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Tim Casey
For smaller length documents TFIDFSimilarity will weight towards shorter documents. Another way to say this, if your documents are 5-10 terms, the 5 terms are going to win. You might think about having per token, or token pair, weight. I would be surprised if there was not something similar out

Re: Howto search for § character

2017-12-07 Thread Tim Casey
My last company we ended up writing a custom analyzer to handle punctuation. But this was for lucent 2 or 3. That analyzer was carried forward as we updated and was used for all human derived text. Although now there are way better analyzers and way better ways to hook them up, as noted above

Re: Java profiler?

2017-12-06 Thread Tim Casey
I really like Profiler. It takes a little bit of set up, but it works. tim On Wed, Dec 6, 2017 at 2:04 AM, Peter Sturge wrote: > Hi, > We'be been using JPRofiler (www.ej-technologies.com) for years now. > Without a doubt, the most comprehensive and useful profiler for

Re: Solr query help

2017-08-18 Thread Tim Casey
You can add a ~3 to the query to allow the order to be reversed, but you will get extra hits. Maybe it is a ~4, i can never remember on phrases and reversals. I usually just try it. Alternatively, you can create a custom query field for what you need from dates. For example, if you want to

Re: Arabic words search in solr

2017-08-02 Thread Tim Casey
There should be a way to use a phrasal query for the specific names. On Wed, Aug 2, 2017 at 2:15 PM, Phil Scadden wrote: > Hopefully changing to default AND solves your problem. If so, I would be > quite interested in what your index config looks like in the end. I also >

Re: Spatial Search based on the amount of docs, not the distance

2017-06-22 Thread Tim Casey
deniz, I was going to add something here. The reason what you want is probably hard to do is because you are asking solr, which stores a document, to return documents using an attribute of document pairs. As only a though exercise, if you stored record pairs as a single document, you could

Re: model building

2017-03-21 Thread Tim Casey
Joe, To do this correctly, soundly, you will need to sample the data and mark them as threatening or neutral. You can probably expand on this quite a bit, but that would be a good start. You can then draw another set of samples and see how you did. You use one to train and one to validate.

Re: query rewriting

2017-03-07 Thread Tim Casey
Hendrik, I would recommend attempting to stick to the query syntax, as it is in lucene, as close as possible. However, if you do your own query parse build up, you can use a Lucene Query object. I don't know where this bolts into solr, exactly. But I have done this extensively with lucene.

Re: Question about best way to architect a Solr application with many data sources

2017-02-22 Thread Tim Casey
I would possibly extend this a bit futher. There is the source, then the 'normalized' version of the data, then the indexed version. Sometimes you realize you miss something in the normalized view and you have to go back to the actual source. This will be as likely as there are number of sources

Re: Chegg is looking for a search engineer

2013-11-18 Thread Tim Casey
I have been chasing the chegg recruiters. I expect to here back from Glenn sometime tomorrow. tim On Mon, Nov 18, 2013 at 6:37 PM, Walter Underwood wun...@wunderwood.orgwrote: I work at Chegg.com and I really like it, but we have more search work than I can do by myself, so we are hiring a