from:"Koji Sekiguchi"

Re: Tokenizing managed synonyms

2020-07-06 Thread Koji Sekiguchi


I think the question makes sense as SynonymGraphFilterFactory accepts 
tokenizerFactory,
he asked the managed version of SynonymGraphFilter could accept it as well.

https://lucene.apache.org/solr/guide/8_5/filter-descriptions.html#synonym-graph-filter

The answer seems to be NO.

Koji


On 2020/07/07 8:18, Erick Erickson wrote:

This question doesn’t really make sense. You don’t specify tokenizers on
filters, they’re specified at the _field_ level.

You can certainly define as many field(type)s as you want, each with a different
analysis chain and those chains can be made up of whatever you want to use, and
there are lots of choices.

If you are asking to do _additional_ tokenization on the output of a synonym
filter, no.

Perhaps if you defined the problem you’re trying to solve we could make some
suggestions.

Best,
Erick


On Jul 6, 2020, at 6:43 PM, Thomas Corthals  wrote:

Hi,

Is it possible to specify a Tokenizer Factory on a Managed Synonym Graph
Filter? I would like to use a Standard Tokenizer or Keyword Tokenizer on
some fields.

Best,

Thomas

per field mm

2018-12-14 Thread Koji Sekiguchi


Hi,

I have a use case that one of our customers wants to set different mm parameter 
per field,
as in some fields of qf, unexpectedly many terms are produced because they are 
N-gram fields
while in other fields, few terms are produced because they are normal text 
fields.

If it is reasonable, I want to add per field mm feature. What do you think 
about this?
And if there is existing jira, let me know.

Thanks,

Koji

Re: Implementing NeuralNetworkModel RankNet in Solr LTR

2018-09-19 Thread Koji Sekiguchi

Hi Edwin,

> Just to check, is this supported in Solr 7.4.0?

Yes, it is.

https://github.com/LTR4L/ltr4l/blob/master/ltr4l-solr/ivy-jars.properties#L17

Koji

On 2018/09/19 19:40, Zheng Lin Edwin Yeo wrote:

Hi Koji,

Thanks for your reply and provide the information.
Just to check, is this supported in Solr 7.4.0?

Regards,
Edwin

On Wed, 19 Sep 2018 at 11:02, Koji Sekiguchi 
wrote:

Hi,

  > https://github.com/airalcorn2/Solr-LTR#RankNet
  >
  > Has anyone tried on this before? And what is the format of the training
  > data that this model requires?

I haven't tried it, but I'd like to inform you that there is another
project of LTR we've been
developed:

https://github.com/LTR4L/ltr4l

It has many LTR algorithms based on neural network, SVM and boosting.

Koji

On 2018/09/12 11:44, Zheng Lin Edwin Yeo wrote:

Hi,

I am working on to implementing Solr LTR in Solr 7.4.0 by using the
NeuralNetworkModel for the feature selection and model training, and I

have

found this site which uses RankNet:
https://github.com/airalcorn2/Solr-LTR#RankNet

Has anyone tried on this before? And what is the format of the training
data that this model requires?

Regards,
Edwin

Re: Implementing NeuralNetworkModel RankNet in Solr LTR

2018-09-18 Thread Koji Sekiguchi

Hi,

> https://github.com/airalcorn2/Solr-LTR#RankNet
>
> Has anyone tried on this before? And what is the format of the training
> data that this model requires?

I haven't tried it, but I'd like to inform you that there is another project of LTR we've been 
developed:

https://github.com/LTR4L/ltr4l

It has many LTR algorithms based on neural network, SVM and boosting.

Koji

On 2018/09/12 11:44, Zheng Lin Edwin Yeo wrote:

Hi,

I am working on to implementing Solr LTR in Solr 7.4.0 by using the
NeuralNetworkModel for the feature selection and model training, and I have
found this site which uses RankNet:
https://github.com/airalcorn2/Solr-LTR#RankNet

Has anyone tried on this before? And what is the format of the training
data that this model requires?

Regards,
Edwin

Re: Return only matched multi-valued field

2017-08-21 Thread Koji Sekiguchi


Hi,

I don't think Lucene/Solr can know which field matches the query you posted.
You should usually use Highlighter to know it.

Koji


On 2017/08/22 2:46, ruby wrote:

Is there a way to return only the matched field from a multivalued field
using filtering?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Return-only-matched-multi-valued-field-tp4351494.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Issues trying to boost phrase containing stop word

2017-07-19 Thread Koji Sekiguchi


Hi Shamik,

I'm sorry but I don't understand why you use KeywordRepeatFilter.

I think it's normal to create separate fields to solve this kind of problems.
Why don't you have another separate field which has ShingleFilter as I 
mentioned in the previous reply?

Koji

On 2017/07/20 12:13, shamik wrote:

Thanks Koji, I've tried KeywordRepeatFilterFactory which keeps the original
term, but the Stopword filter in the analysis chain will remove it
nonetheless. That's why I thought of creating a separate field devoiding of
stopwords/stemmers. Let me know if I'm missing something here.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-trying-to-boost-phrase-containing-stop-word-tp4346860p4346909.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Issues trying to boost phrase containing stop word

2017-07-19 Thread Koji Sekiguchi

Hi Shamik,

How about using ShingleFilter which constructs token n-grams from a token
stream?

http://lucene.apache.org/core/6_6_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html

As for "about dynamic block", ShingleFilter produces "about dynamic" and "dynamic
block".

Thanks,

Koji

On 2017/07/20 5:54, Shamik Bandopadhyay wrote:

Hi,

I'm trying to show titles with exact query phrase match at the top of the
result. That includes supporting stop words as part of the phrase. For e.g.
if I'm using "about dynamic "block" , I expect the title with "About
Dynamic Blocks" to appear at the top. Since the title field uses
stoprword filter factory as part of its analysis chain, I decided to create
a copyfield of title and use that in search with a higher boost. That
didn't seem to work either. Although it brought back the expected document
at the top, it excluded documents with title "Dynamic Block Grip
Reference", to be precise content which doesn't have "about" in title or
subject. Even setting the default operator to OR didn't make any
difference. Here's the entry from config.

Request handler:

explicit
velocity
browse
layout
Solritas
AND
edismax
title^5 titleExact^15 subject^3 description^2

100%
*:*
10
*,score

Sample data:

SOLR1000
About Dynamic Blocks
Dynamic blocks contain rules, or parameters, for how
to change the appearance of the block reference when it is inserted in the
drawing. With dynamic blocks you can insert one block that can change
shape, size, or configuration instead of inserting one of many static block
definitions. For example, instead of creating multiple interior door blocks
of different sizes, you can create one resizable door block. You author
dynamic blocks with either constraint parameters or action parameters.
Note: Using both constraint parameters and action parameters in the same
block definition is not recommended. Add Constraints In a block definition,
constraint parameters Associate objects with one another Restrict geometry
or dimensions The following example shows a block reference with a
constraint (in gray) and a constraint parameter (blue with grip). Once the
block is inserted into the drawing, the constraint parameters can be edited
as properties by using the Properties palette. Add Actions and Parameters
In a block definition, actions and parameters provide rules for the
behavior of a block once it is inserted into the drawing. Depending on the
specified block geometry or parameter, you can associate an action to that
parameter. The parameter is represented as a grip in the drawing. When you
edit the grip, the associated action determines what will change in the
block reference. Like constraint parameters, action parameters can be
changed using the Properties palette.
Dynamic blocks contain rules, or parameters, for
how to change the appearance of the block reference when it is inserted in
the drawing.

SOLR1001
About Creating Dynamic Blocks
This table gives an overview of the steps required
add behaviors that make blocks dynamic. Plan the block content. Know how
the block should change or move, and what parts will depend on the others.
Example: The block will be resizable, and after it is resized, additional
geometry is displayed. Draw the geometry. Draw the block geometry in the
drawing area or the Block Editor. Note: If you will use visibility states
to change how geometry is displayed, you may not want to include all the
geometry at this point. Add parameters. Add either individual parameters or
parameter sets to define geometry that will be affected by an action or
manipulation. Keep in mind the objects that will be dependent on one
another. Add actions. If you are working with action parameters, if
necessary, add actions to define what will happen to the geometry when it
is manipulated. Define custom properties. Add properties that determine how
the block is displayed in the drawing area. Custom properties affect grips,
labels, and preset values for block geometry. Test the block. On the
ribbon, in the Block Editor contextual tab, Open/Save panel, click Test
Block to test the block before you save it.
This table gives an overview of the steps
required add behaviors that make blocks dynamic.

SOLR1002
About Modifying Dynamic Block Definitions
Use the Block Editor to edit, correct, and save a
block definition. Correct Errors in Action Parameters A yellow alert icon (
) is displayed when A parameter is not associated with an action An action
is not associated with a parameter or selection set To correct these
errors, hover over the yellow alert icon until the tooltip displays a
description of the problem. Then double-click the constraint and follow the
prompts. Save Dynamic Blocks When you save a block definition, the current
values of the geometry and parameters in the

Re: Is there any particular reason why ExternalFileField is read from data directory

2017-06-29 Thread Koji Sekiguchi


Hi,

ExternalFileField was introduced via SOLR-351.

https://issues.apache.org/jira/browse/SOLR-351

The author thought values could optionally be updated often...
I think it describes why it is read from not config, but datadir.

Koji


On 2017/06/29 17:17, apoorvqwerty wrote:

Hi,
As per the documentation for ExternalFileField we need to put external_field
with the map in parallel with the data directory on all the shards.
Is it possible to read the file from a central location or zookeeper?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-any-particular-reason-why-ExternalFileField-is-read-from-data-directory-tp4343374.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering results by minimum relevancy score

2017-04-12 Thread Koji Sekiguchi

Hi Walter,

May I ask a tangential question? I'm curious the following line you wrote:

> Solr is a vector-space engine. Some early engines (Verity VDK) were probabilistic engines. Those
do give an absolute estimate of the relevance of each hit. Unfortunately, the relevance of results
is just not as good as vector-space engines. So, probabilistic engines are mostly dead.

Can you elaborate this?

I thought Okapi BM25, which is the default Similarity on Solr, is based on the
probabilistic
model. Did you mean that Lucene/Solr is still based on vector space model but
they built
BM25Similarity on top of it and therefore, BM25Similarity is not pure
probabilistic scoring
system or Okapi BM25 is not originally probabilistic?

As for me, I prefer the idea of vector space than probabilistic for the
information retrieval,
and I stick with ClassicSimilarity for my projects.

Thanks,

Koji

On 2017/04/13 4:08, Walter Underwood wrote:

Fine. It can’t be done. If it was easy, Solr/Lucene would already have the
feature, right?

Solr is a vector-space engine. Some early engines (Verity VDK) were
probabilistic engines. Those do give an absolute estimate of the relevance of
each hit. Unfortunately, the relevance of results is just not as good as
vector-space engines. So, probabilistic engines are mostly dead.

But, “you don’t want to do it” is very good advice. Instead of trying to reduce
bad hits, work on increasing good hits. It is really hard, sometimes not
possible, to optimize both. Increasing the good hits makes your customers
happy. Reducing the bad hits makes your UX team happy.

Here is a process. Start collecting the clicks on the search results page (SRP)
with each query. Look at queries that have below average clickthrough. See if
those can be combined into categories, then address each category.

Some categories that I have used:

* One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all
valid. Use synonyms or shingles (and maybe the word delimiter filter) to match
these.

* Misspellings. These should be about 10% of queries. Use fuzzy matching. I
recommend the patch in SOLR-629.

* Alternate vocabulary. You sell a “laptop”, but people call it a “notebook”.
People search for “kids movies”, but your movie genre is “Children and Family”.
Use synonyms.

* Missing content. People can’t find anything about beach parking because there
isn’t a page about that. Instead, there are scraps of info about beach parking
in multiple other pages. Fix the content.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)

On Apr 12, 2017, at 11:44 AM, David Kramer wrote:

The idea is to not return poorly matching results, not to limit the number of
results returned. One query may have hundreds of excellent matches and another
query may have 7. So cutting off by the number of results is trivial but not
useful.

Again, we are not doing this for performance reasons. We’re doing this because
we don’t want to show products that are not very relevant to the search terms
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be
done” or “here’s how to do it” than “you don’t want to do it”. I’m still left
not knowing if it’s even possible. The one concrete answer of using frange
doesn’t help as referencing score in either the q or the fq produces an
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha" wrote:

Can't the filter be used in cases when you're paginating in
sharded-scenario ?
So if you do limit=10, offset=10, each shard will return 20 docs ?
While if you do limit=10, _score<=last_page.min_score, then each shard will
return 10 docs ? (they will still score all docs, but merging will be
faster)

Makes sense ?

On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti

Re: Classify document using bag of words

2017-03-26 Thread Koji Sekiguchi


Hi,

I'm not sure that it can help you but I'd like to show you the link of an 
article
which I wrote about document classification years ago:

Comparing Document Classification Functions of Lucene and Mahout
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

Thanks!

--
koji

On 2017/03/27 1:05, marotosg wrote:

Hi,

I have a very simple use case where I would need to classify a document
using a bag of words. Basically if a field within the document contains any
of the words on my bag then I use a new field to assign a category to the
document.

Is this something achievable on Solr?

I was thinking on using Lucene Document
classificationhttps://wiki.apache.org/solr/SolrClassification.

From what I understand I need to feed already the category on some

documents. New documents would be classified.

Is there anything else I can't find?

Thanks a lot.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Classify-document-using-bag-of-words-tp4326865.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query/Field Index Analysis corrected but return no docs in search

2017-02-05 Thread Koji Sekiguchi

Hi Peter,

I'm not sure if I can correctly see the result you attached, I think it sounds
reasonable to me
that you couldn't get search result, because your query 均匀肤色 is used as it is
without
being analyzed whereas the same string 均匀肤色 is tokenized as 均匀 匀肤 肤色 in the
index.

So it is obvious that tokenizers you're using in indexing and querying time
don't match.
Please check what tokenizers you're using in your schema.xml.

Thanks,

koji

On 2017/02/04 23:18, Peter Liu wrote:

hi all:
I was using solr 3.6 and tried to solve a recall-problem today , but
encountered a weird problem.

There's doc with field value : 均匀肤色, (just treated that word as a symbol if
you don't know
it, I just want to describe the problem as exact as possible).

And below was the analysis result ( tokenization) :

Inline image 2

( and text-version if need.

Index Analyzer

均匀肤色均匀 匀肤 肤色

Query Analyzer

均匀肤色

The tokenization result indicate the query will recall/hit the doc
undoubtedly. But the doc did
not appear in the result if I search with "均匀肤色". I tried to simplify the
qf/bf/fq/q, just test
it with single field and single document, to make sure it was not caused by
other problems but failed.

It's knotty to debug because it only reproduced in

product environments, I tried same config/index/query but not produce in dev
environment. I'm here
ask for helps if you met similar problem, or any clues/debug-method will be
really helped.

Re: How to train the model using user clicks when use ltr(learning to rank) module?

2017-02-02 Thread Koji Sekiguchi


Hi,

NLP4L[1] has not only Learning-to-Rank module but also a module which calculates
click model and converts it into pointwise annotation data.

NLP4L has a comprehensive manual[2], but you may want to read "Click Log 
Analysis"
section[3] first to see if it suits your requirements.

Hope this helps. Thanks!

Koji
--
T: @kojisays

[1] https://github.com/NLP4L/nlp4l
[2] https://github.com/NLP4L/manuals
[3] https://github.com/NLP4L/manuals/blob/master/ltr/ltr_import.md

On 2017/01/05 17:02, Jeffery Yuan wrote:

Thanks very much for integrating machine learning to Solr.
https://github.com/apache/lucene-solr/blob/f62874e47a0c790b9e396f58ef6f14ea04e2280b/solr/contrib/ltr/README.md

In the Assemble training data part: the third column indicates the relative
importance or relevance of that doc
Could you please give more info about how to give a score based on what user
clicks?

I have read
https://static.aminer.org/pdf/PDF/000/472/865/optimizing_search_engines_using_clickthrough_data.pdf
http://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf
http://alexbenedetti.blogspot.com/2016/07/solr-is-learning-to-rank-better-part-1.html

But still have no clue how to translate the partial pairwise feedback to the
importance or relevance of that doc.


From a user's perspective, the steps such as setup the feature and model in

Solr is simple, but collecting the feedback data and train/update the model
is much more complex.

It would be great Solr can provide some detailed instruction or sample code
about how to translate the partial pairwise feedback and use it to train and
update model.

Thanks again for your help.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-train-the-model-using-user-clicks-when-use-ltr-learning-to-rank-module-tp4312462.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: I cannot get phrases highlighted correctly without using the Fast Vector highlighter

2016-09-20 Thread Koji Sekiguchi


Hello Panagiotis,

I'm sorry but it's a feature. As for hl.usePhraseHighlighter parameter, when 
you turn off it,
you may get only foo or bar highlighted in your snippets.

Koji

On 2016/09/18 15:55, Panagiotis T wrote:

I'm using Solr 6.2 (tried with 6.1 also)

I created a new core and the only change I made is adding the
following line in my schema.xml



I've indexed two simple xml files. Here's a sample:



foo bar
foo bar



I'm executing a simple query:
http://localhost:8983/solr/test/select?hl.fl=body_text_en=on=on=%22foo%20bar%22=json

And here is the response:

  "response":{"numFound":2,"start":0,"docs":[
  {
"id":"foo bar",
"body_text_en":["foo bar"],
"_version_":1545790848171507712},
  {
"id":"foo bar2",
"body_text_en":["I strongly suspect that foo bar"],
"_version_":1545790848184090624}]
  },
  "highlighting":{
"foo bar":{
  "body_text_en":["foo bar"]},
"foo bar2":{
  "body_text_en":["I strongly suspect that foo bar"]}}}

If I append hl.useFastVectorHighlighter=true to my query the
highlighter correctly highlights the phrase as foo bar. Of
course I've tried explicitly appending hl.usePhraseHighlighter=true to
my query but I get the same result. I would like to get the same
result with the standard highlighter if possible.


Regards

Re: Query Elevation

2016-07-11 Thread Koji Sekiguchi


Hello,

I'm curious, why do you want the particular document to place second, not top,
of the result for a particular query?

Sorry this isn't the answer for your question, but I think you can implement it 
rather easy
if you study the existing query elevation.

Koji

On 2016/07/08 19:59, Swathika wrote:

A new requirement to get particular document as second result in result page.

For example, If the query is “coal”, this document(id: 222) should come as
second result.

Please let me know if you have any solution.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-Elevation-tp4286332.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: FW: Difference Between Tokenizer and filter

2016-03-02 Thread Koji Sekiguchi

Hi,

... must have one and only one and
it can have zero or more s. From the point of view of the
rules, your ... is not correct
because it has more than one and
... is not correct as well because it has no .

Koji

On 2016/03/02 20:25, G, Rajesh wrote:

Hi Team,

Can you please clarify the below. My understanding is tokenizer is used to say how the
content should be indexed physically in file system. Filters are used to query result. The
blow lines are from my setup. But I have seen eg that include filters inside and tokenizer in that confused me.

My goal is to user solr and find the best match among the technology names e.g
Actual tech name

1. Microsoft Visual Studio

2. Microsoft Internet Explorer

3. Microsoft Visio

When user types Microsoft Visal Studio user should get Microsoft Visual Studio.
Basically misspelled and jumble words should match closest tech name

Corporate Executive Board India Private Limited. Registration No:
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..

This e-mail and/or its attachments are intended only for the use of the
addressee(s) and may contain confidential and legally privileged information
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer
SHL Talent Measurement products and services. If you have received this e-mail
in error, please notify the sender and immediately, destroy all copies of this
email and its attachments. The publication, copying, in whole or in part, or
use or dissemination in any other way of this e-mail and attachments by anyone
other than the intended person(s) is prohibited.

Re: Help With Phrase Highlighting

2015-12-01 Thread Koji Sekiguchi


Hi Teague,

I couldn't understand the part of "document size" in your question, but if 
you'd like
Solr to return snippet

My search phrase

instead of

My search phrase

you should use FastVectorHighlighter. In case use of FVH, your highlight field 
(hl.fl=text)
need to be indexed with options termVectors=true, termPositions=true and 
termPositions=true.

Good luck!

Koji


On 2015/12/02 5:36, Teague James wrote:

Hello everyone,

I am having difficulty enabling phrase highlighting and am hoping someone
here can offer some help. This is what I have currently:

Solr 4.9
solrconfig.xml (partial snip)


xml
explicit
10
text
on
text
html
100





schema.xml (partial snip)



Query (partial snip):
...select?fq=id:43040="my%20search%20phrase"

Response (partial snip):
...

ipsum dolor sit amet, pro ne verear prompta, sea te aeterno scripta
assentior. (my search


phrase facilitates highlighting). Et option molestiae referrentur
ius. Viris quaeque legimus an pri


The document in which this phrase is found is very long. If I reduce the
document to a single sentence, such as "My search phrase facilitates
highlighting" then the response I get from Solr is:

My search phrase facilitates highlighting


What I am trying to achieve instead, regardless of the document size is:
My search phrase with a single indicator at the beginning
and end rather than three separate words that may get dsitributed between
two different snippets depending on the placement of the snippet in te
larger document.

I tried to follow this guide:
http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-
search-phrase-only/25970452#25970452 but got zero results. I suspect that
this is due to the hl parameters in my solrconfig file, but I cannot find
any specific guidance on the correct parameters should be. I tried
commenting out all of the hl parameters and also got no results.

Can anyone offer any solutions for searching large documents and returning a
single phrase highlight?

-Teague

Re: Tokenize ShingleFilterFactory results and apply filters to tokens

2015-10-15 Thread Koji Sekiguchi


Hi Vitaly,

I'm not sure I understand you correctly, why don't you put EdgeNGramFilter just 
after
ShingleFilter? That is:





Koji

On 2015/10/15 22:47, vitaly bulgakov wrote:

I want to rephrase my question I asked in another post.
As far as I understand filter ShingleFilterFactory creates shingle as
strings.
But I want to apply more filters (like EdgeNgrams) to each token of a
shingle.

For example from "Home Improvement Service" I have two shingles:
"Home Improvement" and "Improvement Service".

I want to apply EdgeNgram to be able to do exact match to:
"Hom Improvem" and "Improvemen Servi" as new phrases.

Any, help, ideas are welcomed and appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenize-ShingleFilterFactory-results-and-apply-filters-to-tokens-tp4234574.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: highlighting

2015-10-01 Thread Koji Sekiguchi


Hi Mark,

I think I saw similar requirement recently in mailing list. The feature sounds 
reasonable to me.

> If not, how do I go about posting this as a feature request?

JIRA can be used for the purpose, but there is no guarantee that the feature is 
implemented. :(

Koji

On 2015/10/01 20:07, Mark Fenbers wrote:

Yeah, I thought about using markers, but then I'd have to search the the text 
for the markers to
determine the locations.  This is a clunky way of getting the results I want, 
and it would save two
steps if Solr merely had an option to return a start/length array (of what 
should be highlighted) in
the original string rather than returning an altered string with tags inserted.

Mark

On 9/29/2015 7:04 AM, Upayavira wrote:

You can change the strings that are inserted into the text, and could
place markers that you use to identify the start/end of highlighting
elements. Does that work?

Upayavira

On Mon, Sep 28, 2015, at 09:55 PM, Mark Fenbers wrote:

Greetings!

I have highlighting turned on in my Solr searches, but what I get back
is  tags surrounding the found term.  Since I use a SWT StyledText
widget to display my search results, what I really want is the offset
and length of each found term, so that I can highlight it in my own way
without HTML.  Is there a way to configure Solr to do that?  I couldn't
find it.  If not, how do I go about posting this as a feature request?

Thanks,
Mark

Re: solr.SynonymFilterFactory

2015-09-17 Thread Koji Sekiguchi


Hi Vincenzo,

By intuition, regardless of what value you set for attributes such as expand or 
ignoreCase,
I think synonym records that LHS==RHS are meaningless. That is, you can remove 
these lines.

Koji


On 2015/09/17 16:51, Vincenzo D'Amore wrote:

Hello,

this may be a silly question.
I have found a synonyms file with a lot of cases where LHS is equal to RHS.

airmax=>airmax
airplane=>airplane
airwell=>airwell
akai=>akai
akasa=>akasa
akea=>akea
akg=>akg

Given that the solr.SynonymFilterFactory is configured with expand="false"
ignoreCase="true"

May I remove all these lines?

Bests,
Vincenzo

Re: How to export the list of terms indexed in Solr?

2015-04-29 Thread Koji Sekiguchi


Hi brent3600,

You can use NLP4L for this purpose. NLP4L is good at counting the number of 
words
not only in whole index but also in a set of documents. There is a tutorial
for this function.

Count the number of words
http://nlp4l.github.io/tutorial_ja.html#useNLP

Sorry but the tutorial is written in Japanese now. We'll provide English 
tutorial soon.
Until then please use translation service to read it in English. :)

Koji

On 2015/04/30 7:34, brent3600 wrote:

We are indexing collections of documents (files) with SOLR, and would like
the following capability:

Export or pull from SOLR the list of terms that have been indexed for a
document or set of documents, along with the term frequency count.
1.  Does SOLR already provide an API or method to accomplish this?
2.  If not, is there an add-on module that provides this functionality?
3.  If not, is it technically feasible at a low level of effort to add this
functionality?

- brent3600



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-export-the-list-of-terms-indexed-in-Solr-tp4203124.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sorting and Rerank

2015-03-25 Thread Koji Sekiguchi


Hi,

You're right. Those sets are same each other, only documents order is different.

Koji


On 2015/03/26 0:53, innoculou wrote:

If I do an initial search without any field sorting; and then do the exact
same query but also sort one field will I get the same result set in the
subsequent query but sorted.  In other words, does simply applying a sort
criteria affect the re-rank on the full search or does it just sort the
result from the main query?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-and-Rerank-tp4195187.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Lucene cosine similarity score for more like this query

2015-02-03 Thread Koji Sekiguchi


Lucene uses TFIDFSimilarity class to calculate the similarity.
It is implemented on the idea of cosine measurement but it modifies the cosine 
formula.
Please take a look at Lucene Practical Scoring Function in the following 
Javadoc:

http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

On 2015/02/03 5:39, Ali Nazemian wrote:

Dear Erik,
Thank you for your response. Would younplease tell me why this score could
be higher than 1? While cosine similarity can not be higher than 1.
On Feb 2, 2015 7:32 PM, Erik Hatcher erik.hatc...@gmail.com wrote:


The scoring is the same as Lucene.  To get deeper insight into how a score
is computed, use Solr’s debug=true mode to see the explain details in the
response.

 Erik


On Feb 2, 2015, at 10:49 AM, Ali Nazemian alinazem...@gmail.com wrote:

Hi,
I was wondering what is the range of score is brought by more like this
query in Solr? I know that the Lucene uses cosine similarity in vector
space model for calculating similarity between two documents. I also know
that cosine similarity is between -1 and 1 but the fact that I dont
understand is why the score which is brought by more like this query

could

be 12 for example?! Would you please explain what is the calculation
process is Solr?
Thank you very much.

Best regards.

--
A.Nazemian

[ANN] word2vec for Lucene

2014-11-20 Thread Koji Sekiguchi

Hello,

It's my pleasure to share that I have an interesting tool word2vec for Lucene
available at https://github.com/kojisekig/word2vec-lucene .

As you can imagine, you can use word2vec for Lucene to extract word vectors 
from Lucene index.

Thank you,

Koji
-- 
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

Re: [ANN] word2vec for Lucene

2014-11-20 Thread Koji Sekiguchi

Hi Paul,

I cannot compare it to SemanticVectors as I don't know SemanticVectors.
But word vectors that are produced by word2vec have interesting properties.

Here is the description of the original word2vec web site:

https://code.google.com/p/word2vec/#Interesting_properties_of_the_word_vectors
Interesting properties of the word vectors
It was recently shown that the word vectors capture many linguistic 
regularities, for example vector
operations vector('Paris') - vector('France') + vector('Italy') results in a 
vector that is very
close to vector('Rome'), and vector('king') - vector('man') + vector('woman') 
is close to
vector('queen')

Thanks,

Koji


(2014/11/20 20:01), Paul Libbrecht wrote:
 Hello Koji,
 
 how would you compare that to SemanticVectors?
 
 paul
 
 On 20 nov. 2014, at 10:10, Koji Sekiguchi k...@r.email.ne.jp wrote:
 
 Hello,

 It's my pleasure to share that I have an interesting tool word2vec for 
 Lucene
 available at https://github.com/kojisekig/word2vec-lucene .

 As you can imagine, you can use word2vec for Lucene to extract word 
 vectors from Lucene index.

 Thank you,

 Koji
 -- 
 http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 


-- 
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

Re: [ANN] word2vec for Lucene

2014-11-20 Thread Koji Sekiguchi

Thanks Glen for the URL. I'd like to check it when I am available.

Thanks Paul for giving me the difference between them. I like your description!

Koji

(2014/11/21 2:18), Paul Libbrecht wrote:
 As far as I could tell, word2vec seems more mathematical, which is rather 
 nice.
 At least I see more transparent math in the web-page.
 Maybe this helps a bit?
 
 SemanticVectors has always rather pleasant for the LSI/LSA-like approach, but 
 precisely this is mathematically opaque.
 Maybe it's more a question of presentation.
 
 Paul
 
 
 On 20 nov. 2014, at 16:24, Koji Sekiguchi k...@r.email.ne.jp wrote:
 
 Hi Paul,

 I cannot compare it to SemanticVectors as I don't know SemanticVectors.
 But word vectors that are produced by word2vec have interesting properties.

 Here is the description of the original word2vec web site:

 https://code.google.com/p/word2vec/#Interesting_properties_of_the_word_vectors
 Interesting properties of the word vectors
 It was recently shown that the word vectors capture many linguistic 
 regularities, for example vector
 operations vector('Paris') - vector('France') + vector('Italy') results in a 
 vector that is very
 close to vector('Rome'), and vector('king') - vector('man') + 
 vector('woman') is close to
 vector('queen')

 Thanks,

 Koji


 (2014/11/20 20:01), Paul Libbrecht wrote:
 Hello Koji,

 how would you compare that to SemanticVectors?

 paul

 On 20 nov. 2014, at 10:10, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Hello,

 It's my pleasure to share that I have an interesting tool word2vec for 
 Lucene
 available at https://github.com/kojisekig/word2vec-lucene .

 As you can imagine, you can use word2vec for Lucene to extract word 
 vectors from Lucene index.

 Thank you,

 Koji
 -- 
 http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 -- 
 http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 


-- 
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

Re: [ANN] word2vec for Lucene

2014-11-20 Thread Koji Sekiguchi


Hi Joseph,

Thank you for asking. If you want to do it in the interactive sense,
it won't work well practically because it takes several minutes for learning.

If you accept working in batch sense, the feature can be implemented,
but I've not done it yet. I have the open ticket for that:

accept filter query
https://github.com/kojisekig/word2vec-lucene/issues/2

Thanks,

Koji

(2014/11/21 8:22), Joseph Obernberger wrote:

Hi Koji - is it possible to execute word2vec on a subset of documents from
Solr?  -  ie could I run a query, get back the top n results and pass only
those to word2vec?
Will this work with Solr Cloud?

Thank you!

-Joe

On Thu, Nov 20, 2014 at 12:18 PM, Paul Libbrecht p...@hoplahup.net wrote:


As far as I could tell, word2vec seems more mathematical, which is rather
nice.
At least I see more transparent math in the web-page.
Maybe this helps a bit?

SemanticVectors has always rather pleasant for the LSI/LSA-like approach,
but precisely this is mathematically opaque.
Maybe it's more a question of presentation.

Paul


On 20 nov. 2014, at 16:24, Koji Sekiguchi k...@r.email.ne.jp wrote:


Hi Paul,

I cannot compare it to SemanticVectors as I don't know SemanticVectors.
But word vectors that are produced by word2vec have interesting

properties.


Here is the description of the original word2vec web site:



https://code.google.com/p/word2vec/#Interesting_properties_of_the_word_vectors

Interesting properties of the word vectors
It was recently shown that the word vectors capture many linguistic

regularities, for example vector

operations vector('Paris') - vector('France') + vector('Italy') results

in a vector that is very

close to vector('Rome'), and vector('king') - vector('man') +

vector('woman') is close to

vector('queen')

Thanks,

Koji


(2014/11/20 20:01), Paul Libbrecht wrote:

Hello Koji,

how would you compare that to SemanticVectors?

paul

On 20 nov. 2014, at 10:10, Koji Sekiguchi k...@r.email.ne.jp wrote:


Hello,

It's my pleasure to share that I have an interesting tool word2vec

for Lucene

available at https://github.com/kojisekig/word2vec-lucene .

As you can imagine, you can use word2vec for Lucene to extract word

vectors from Lucene index.


Thank you,

Koji
--


http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





--


http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org









--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

Re: boosting words from specific list

2014-09-29 Thread Koji Sekiguchi


Hi Ali,

I don't think Solr has such function OOTB. One way I can think of is that
you can implement UpdateRequestProcessor. In processAdd() method of
the UpdateRequestProcessor, as you can read field values, you can calculate
the total score and copy the total score to a field e.g. total_score.
Then you can sort the query result on total_score field when you query.

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

(2014/09/29 4:25), Ali Nazemian wrote:

Dear all,
Hi,
I was wondering how can I implement solr boosting words from specific list
of important words? I mean I want to have a list of important words and
tell solr to score documents based on the weighted sum of these words. For
example let word school has weight of 2 and word president has the
weight of 5. In this case a doc with 2 school words and 3 president
words will has the total score of 19! I want to sort documents based on
this score. How such procedure is possible in solr? Thank you very much.
Best regards.

Re: statuscode list

2014-09-07 Thread Koji Sekiguchi


Hi Jan,

(2014/09/05 21:01), Jan Verweij - Reeleez wrote:

Hi,

If I'm correct you will get a statuscode=0 in the response if you
use XML messages for updating the solr index.


I think you mean by statuscode=0 is status=0 here.

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime7/int/lst
/response


Is there a list of possible other statuscodes you can receive in case
anything fails and what these errorcodes mean?


I don't think we have a list of possible other status because Solr
doen't return status other than 0. Instead of status code in XML,
you should look at HTTP status code e.g. 200 OK, 404 Not Found, etc.
because if there is something wrong on Solr while updating (even querying)
index, Solr may not return XML anyway.

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

Re: ExternalFileFieldReloader and commit

2014-08-05 Thread Koji Sekiguchi


Hi Peter,

It seems like a bug to me, too. Please file a JIRA ticket if you can
so that someone can take it.

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

(2014/08/05 22:34), Peter Keegan wrote:

When there are multiple 'external file field' files available, Solr will
reload the last one (lexicographically) with a commit, but only if changes
were made to the index. Otherwise, it skips the reload and logs: No
uncommitted changes. Skipping IW.commit.  Has anyone else noticed this? It
seems like a bug to me. (yes, I do have firstSearcher and newSearcher event
listeners in solrconfig.xml)

Peter

Re: Understanding the Debug explanations for Query Result Scoring/Ranking

2014-07-24 Thread Koji Sekiguchi


Hi,

In addition, this might be useful:

Fundamentals of Information Retrieval, Illustration with Apache Lucene
https://www.youtube.com/watch?v=SCsS5ePGmCs

This video is about 40 minutes long, but you can fast forward to 24:00
to learn scoring based on vector space model and how Lucene customize it.

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

(2014/07/25 8:00), Uwe Reh wrote:

Hi,

to get an idea of the meaning of all this numbers, have a look on 
http://explain.solr.pl. I like
this tool, it's great.

Uwe

Am 25.07.2014 00:45, schrieb O. Olson:

Hi,

If you add /*debug=true*/ to the Solr request /(and wt=xml if your
current output is not XML)/, you would get a node in the resulting XML that
is named debug. There is a child node to this called explain to this
which has a list showing why the results are ranked in a particular order.
I'm curious if there is some documentation on understanding these
numbers/results.

I am new to Solr, so I apologize that I may be using the wrong terms to
describe my problem. I also aware of
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
though I have not completely understood it.

My problem is trying to understand something like this:

1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in
44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0
= termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of:
7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226
= fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109)
[DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 =
termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 =
idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 =
fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of:
6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109)

*Note:* I have searched for televisions. My search field is a single
catch-all field. The Edismax parser seems to break up my search term into
televis and tv

Is there some documentation on how to understand these numbers. They do not
seem to be properly delimited. At the minimum, I can understand something
like:
1.5797625 =  0.4717142 + 1.1080483
and
0.71447384  = 7.0424104 * 0.10145303

But, I cannot understand if something like 0.10145303 = queryNorm 0.660226
= fieldWeight in 44109 is used in the calculation anywhere. Also since
there were only two terms /(televis and tv)/ I could use subtraction to
find out 1.1080483 was the start of a new result.

I'd also appreciate if someone can tell me which class dumps out the above
data. If I know it, I can edit that class to make the output a bit more
understandable for me.

Thank you,
O. O.






--
View this message in context:
http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: Contiguous Phrase Highlighting Example

2014-07-17 Thread Koji Sekiguchi


Hi Teague,

If you want phrase-unit tagging for highlighter, you need to use
FastVectorHighlighter instead of the ordinary Highlighter.

To turn on FVH, set hl.useFastVectorHighlighter=on when querying.
In addition, when indexing, you need to set termVectors=on, termPositions=on
and termOffsets=on on content field in your schema.xml.

http://wiki.apache.org/solr/HighlightingParameters#hl.useFastVectorHighlighter

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

(2014/07/18 3:19), Teague James wrote:

Hi everyone!

Does anyone have any good examples of generating a contiguous highlight for
a phrase? Here's what I have done:

curl http://localhost/solr/collection1/update?commit=true -H Content-Type:
text/xml --data-binary 'adddocfield name=id100/fieldfield
name=contentblah blah blah knowledge of science blah blah
blah/field/doc/add'

Then, using a browser:

http://localhost/solr/collection1/select?q=knowledge+of+sciencefq=id:100

What I get back in highlighting is:
strblah blah blah bknowledge/b bof/b bscience/b blah blah
blah/str

What I want to get back is:
strblah blah blah bknowledge of science/b blah blah blah/str

I have the following highlighting configurations in my requestHandler in
addition to hl, hl.fl, etc.:
str name=hl.mergeContiguousfalse/str
str name=usePhraseHighlightertrue/str
str name-highlightMultiTermtrue/str
None of the last two seemed to have any impact on the output. I've tried
every permutation of those three, but the output is the same. Any
suggestions or examples of getting highlights to come back this way? I'd
appreciate any advice on this! Thanks!

-Teague

Re: OCR - Saving multi-term position

2014-07-02 Thread Koji Sekiguchi


Hi Manuel,

I think OCR error correction is one of well-known NLP tasks.
I'd thought it could be implemented in the past by using Lucene.

This is a brief idea:

1. You have got a Lucene index. This existing index is made from
correct (i.e. error free) documents that are same domain of OCR documents.

2. Tokenize OCR text by ShingleTokenizer. By ShingleTokenizer, you'll get:

the quiok
tlne quick
the quick
:

3. Search those phrase in the existing index. I think exact search
(PhraseQuery) or FuzzyQuery can be worked. You should get the highest hit
count when searching the quick among those phrases.

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

(2014/07/02 7:19), Manuel Le Normand wrote:

Hello,
Many of our indexed documents are scanned and OCR'ed documents.
Unfortunately we were not able to improve much the OCR quality (less than
80% word accuracy) for various reasons, a fact which badly hurts the
retrieval quality.

As we use an open-source OCR, we think of changing every scanned term
output to it's main possible variations to get a higher level of confidence.

Is there any analyser that supports this kind of need or should I make up a
syntax and analyser of my own, i.e the payload syntax?

The quick brown fox -- The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4

Thanks,
Manuel

Re: Restriction on type of uniqueKey field?

2014-07-01 Thread Koji Sekiguchi


In addition, KeywordTokenizer can be seemingly used but it should be avoided
for unique key field. One of my customers that used it and they had got OOM
during a long term indexing. As it was difficult to find the problem,
I'd like to share my experience.

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

(2014/07/01 6:48), Alexandre Rafalovitch wrote:

I wasn't thinking of shard keys, but may have been confused in the reading.

Thank you everyone, the long key is working just fine for me.

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, Jul 1, 2014 at 8:15 PM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:

Alex, maybe you're thinking of constraints put on shard keys?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Jul 1, 2014 at 7:05 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:


No, you definitely can have an int or long uniqueKey. A lot of Solr's tests
use such a uniqueKey. See
solr/core/src/test-files/solr/collection1/conf/schema.xml


On Tue, Jul 1, 2014 at 3:20 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:


Hello,

I remember reading somewhere that id field (uniqueKey) must be String.
But I cannot find the definitive confirmation, just that it should be
non-analyzed.

Can I use a single-valued TrieLongField type, with precision set to 0?
Or am I going to hit issues?

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency





--
Regards,
Shalin Shekhar Mangar.

Re: Multiple highlight snippet for single field

2014-05-16 Thread Koji Sekiguchi


Hi Bijan,

Have you tried to set hl.maxAnalyzedChars parameter to larger number?

hl.maxAnalyzedChars
http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars

As the default value of the parameter is 51200, if the second Andy is
at the end paragraph of your large stored field, the highloghter doesn't
deals with the second Andy.

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

(2014/05/16 13:25), Bijan Pourriahi wrote:

Hello all,

I am trying to return multiple snippets from a single document with a field 
which includes many (5+) instances of the word ‘andy’ in the text. For some 
reason, I can only get it to return one snippet. Any ideas?

Here’s the query and the response:
http://codejaw.com/2gwoozr

Thanks!

- Bijan

This e-mail transmission and any documents, files or previous e-mail messages 
attached to it, are confidential. If you are not the intended recipient, or a 
person responsible for delivering it to the intended recipient, you are hereby 
notified that any review, disclosure, copying, dissemination, distribution or 
use of any of the information contained in, or attached to this e-mail 
transmission is STRICTLY PROHIBITED. If you have received this transmission in 
error, please immediately notify the sender then delete immediately.

Re: AND not as a boolean operator in Phrase

2014-03-25 Thread Koji Sekiguchi


(2014/03/26 2:29), abhishek jain wrote:

hi friends,

when i search for A and B it gives me result for A , B , i am not sure
why?

Please guide how can i exact match when it is within phrase/quotes.


Generally speaking (w/ LuceneQParser), if you want phrase match results,
use quotes, i.e. q=A B. If you want results which contain both terms A
and B, do not use quotes but boolean operator AND, i.e. q=A AND B.

koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

Re: Solr Nutch

2014-01-28 Thread Koji Sekiguchi


1. Nutch follows the links within HTML web pages to crawl the full graph of a 
web of pages.


In addition, I think Nutch has PageRank-like scoring function as opposed to
Lucene/Solr, those are based on vector space model scoring.

koji
--
http://soleami.com/blog/mahout-and-machine-learning-training-course-is-here.html

Re: document contained more than 100000 characters

2013-12-25 Thread Koji Sekiguchi


Hi,

I'm not sure but you probably met Tika exception.
Have you checked Apache Tika mailing list?

Hmm, just now I googled Your document contained more than 10 characters,
I found a page in StackOverFlow. According to it, there is API to change
the limit. But I don't know whether Solr can change the limit.
If there is no chance to change the limit in Solr, you can open a JIRA ticket.

koji
--
http://soleami.com/blog/mahout-and-machine-learning-training-course-is-here.html

(13/12/23 2:17), Nutan wrote:

Why is the error as :
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your
document contained more than 10 characters, and so your requested limit
has been reached. To receive the full text of the document, increase your
limit. (Text up to the limit is however available).
at
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:140)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)


when i added this in solrconfig.xml
requestDispatcher handleSelect=false 
   requestParsers enableRemoteStreaming=true
multipartUploadLimitInKB=200048 /
 /requestDispatcher



--
View this message in context: 
http://lucene.472066.n3.nabble.com/document-contained-more-than-10-characters-tp4107792.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing from bowser

2013-12-16 Thread Koji Sekiguchi


Hi,

(13/12/16 19:46), Nutan wrote:

how to index pdf,doc files from browser?


I think you can index from browser.

If you said that


this query is used for indexing :
curl
http://localhost:8080/solr/document/update/extract?literal.id=12commit=true;
-Fmyfile=@C:\solr\document\src\test1\Coding.pdf


curl works for you but


When i try to index using this:
http://localhost:8080/solr/document/update/extract?literal.id=12commit=true;
-Fmyfile=@C:\solr\document\src\test1\Coding.pdf

the document does not get indexed.


browser doesn't work for you, why don't you look into Solr log and
compare the logs between when you using curl and browser?

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: Passing a Parameter to a Custom Processor

2013-12-13 Thread Koji Sekiguchi


Hi Dileepa,


The stanbolInterceptor processor chain will be used in multiple request
handlers. Then I will have to pass the stanbol.enhancer.url param in each
of those request handler which will cause redundant configurations.
Therefore I need to pass the param to the processor directly.

But when I pass the params to the Processor as below the parameter is not
received to my ProcessorFactory class;
processor class=com.solr.stanbol.processor.StanbolContentProcessorFactor *
str name=stanbol.enhancer.urlhttp://localhost:8080/enhancer
http://localhost:8080/enhancer/str*
/processor

Can someone point out what might be wrong here? Can someone please advice
on how to pass parameters directly to the Processor?


I don't know why your Processor cannot get the parameters, but Processor should
get them. For example, StatelessScriptUpdateProcessorFactory can get script
parameter like this:

processor class=solr.StatelessScriptUpdateProcessorFactory
   str name=scriptupdateProcessor.js/str
/processor

http://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html

So why don't you consult the source code of 
StatelessScriptUpdateProcessorFactory, etc?

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: SOLRJ API to do similar CURL command execution

2013-11-13 Thread Koji Sekiguchi


(13/11/13 22:25), Anupam Bhattacharya wrote:

How can I post the whole XML string to SOLR using its SOLRJ API ?




The source code of SimplePostTool would be of some help:

http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/util/SimplePostTool.html

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: count links pointing to id

2013-11-10 Thread Koji Sekiguchi


(13/11/10 3:43), Andreas Owen wrote:

I have a multivalue field with links pointing to ids of solrdocuments. I
would like calculate how many links are pointing to each document und put
that number into the field links2me. How can I do this, I would prefer to do
it with a query and the updater so solr can do it internaly if possible?


I don't think Solr can do it internally. You should sum up the link counts
per id and put the sum to links2me field before indexing.

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: solr sort facets by name

2013-11-05 Thread Koji Sekiguchi


(13/11/06 9:00), PeterKerk wrote:

By default solr sorts facets by the amount of hits for each result. However,
I want to sort by facetnames alphabetically. Earlier I sorted the facets on
the client or via my .NET code, however, this time I need solr to return the
results with alphabetically sorted facets directly.
How?


Isn't it facet.sort=index ?

http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: Unable to add mahout classifier

2013-10-31 Thread Koji Sekiguchi


Caused by: java.lang.ClassCastException: class 
com.mahout.solr.classifier.CategorizeDocumentFactory
 at java.lang.Class.asSubclass(Unknown Source)
 at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:433)
 at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:381)
 at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:526)
 ... 21 more


There seems to be a problem related class loaders, e.g. 
CategorizeDocumentFactory
which extends UpdateRequestProcessorFactory, loaded by class loader B,
but Solr core has loaded UpdateRequestProcessorFactory via class loader A
or something like that...

koji
--
http://www.rondhuit.com/

Re: Unable to add mahout classifier

2013-10-30 Thread Koji Sekiguchi


(13/10/30 22:09), lovely kasi wrote:

Hi,

I made few changes to the solrconfig.xml, created a jar file,added it to
the lib folder of the solr and tried to start it.

THe changes in the solrconfig.xml are

updateRequestProcessorChain name=mahoutclassifier default=true
   processor class=com.mahout.solr.classifier.CategorizeDocumentFac
 str name=inputFieldLEAD_NOTES/str
 str name=outputFieldcategory/str
 str name=defaultCategoryOthers/str
 str name=modelnaiveBayesModel/str
   /processor
   processor class=solr.RunUpdateProcessorFactory/
   processor class=solr.LogUpdateProcessorFactory/
 /updateRequestProcessorChain


What is com.mahout.solr.classifier.CategorizeDocumentFac ?
Is it a classifier delivered by Solr community?

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: Return the synonyms as part of Solr response

2013-10-30 Thread Koji Sekiguchi


Hi Siva,

(13/10/30 18:12), sivaprasad wrote:

Hi,
We have a requirement where we need to send the matched synonyms as part of
Solr response.


I don't think that Solr has such function.


Do we need to customize the Solr response handler to do this?


So the answer is yes.

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: Help on solr more like this functionality

2013-10-26 Thread Koji Sekiguchi


Hi Suren,

(13/10/25 23:36), Suren Raju wrote:

Hi,

We are trying to solve a business problem by performing solr more like this
query. We are able to perform the more like this search. We have a specific
use case that requires different boost on different match fields. Say i do
more like this based on fields title and description of products. I wanna
provide more boost for match field *title *than the description.

Query im trying so far is

mysolrhost:8983/solr/mlt?q=id:UTF8TESTmlt.fl=title,descriptionmlt.mindf=1mlt.mintf=1

Is there any way to provide different boost for title and description?



I don't have much experience on MLT, but index time boosting might help you?

Koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: how to debug my own analyzer in solr

2013-10-21 Thread Koji Sekiguchi


Hi Mingz,

If you use Eclipse, you can debug Solr with your plugin like this:

# go to Solr install directory
$ cd $SOLR
$ ant run-example -Dexample.debug=true

Then connect the JVM from Eclipse via remote debug port 5005.

Good luck!

koji


(13/10/21 18:58), Mingzhu Gao wrote:

More information about this , the custom analyzer just implement
createComponents of Analyzer.

And my configure in schema.xml is just something like :

fieldType name=text_cn class=solr.TextField 
  analyzer class=my.package.CustomAnalyzer /
/fieldType



From the log I cannot see any error information , however , when I want to

analysis or add document data , it always hang there .

Any way to debug or narrow down the problem ?

Thanks in advance .

-Mingz

On 10/21/13 4:35 PM, Mingzhu Gao m...@adobe.com wrote:


Dear solr expert ,

I would like to write my own analyser ( Chinese analyser ) and integrate
them into solr as solr plugin .

From the log information , the custom analyzer can be loaded into solr
successfully .  I define my fieldType with this custom analyzer.

Now the problem is that ,  when I try this analyzer from
http://localhost:8983/solr/#/collection1/analysis , click the analysis ,
then choose my FieldType , then input some text .
After I click Analyse Value button , the solr hang there , I cannot get
any result or response in a few minutes.

I also try to add  some data by curl
http://localhost:8983/solr/update?commit=true -H Content-Type: text/xml
, or by post.sh in exampledocs folder ,
The same issue , the solr hang there , no result and not response .

Can anybody give me some suggestions on how to debug solr to work with my
own custom analyzer ?

By the way , I write a java program to call my custom analyzer , the
result is okay , for example , the following code can work well .
==
Analyzer analyzer = new MyAnalyzer() ;

TokenStream ts = analyzer.tokenStream() ;

CharTermAttribute ta = ts.getAttribute(CharTermAttribute.class);

ts.reset();

while (ts.incrementToken()){

System.out.println(ta.toString());

}

=


Thanks,

-Mingz







--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: ExtractRequestHandler, skipping errors

2013-10-18 Thread Koji Sekiguchi


Hi,

I think the flag cannot ignore NoSuchMethodError. There may be something wrong 
here?

... I've just checked my Solr 4.5 directories and I found Tika version is 1.4.

Tika 1.4 seems to use commons compress 1.5:

http://svn.apache.org/viewvc/tika/tags/1.4/tika-parsers/pom.xml?view=markup

But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/ directory.

Can you open a JIRA issue?

For now, you can get commons compress 1.5 and put it to the directory
(don't forget to remove 1.4.1 jar file).

koji

(13/10/18 16:37), Roland Everaert wrote:

Hi,

We already configure the extractrequesthandler to ignore tika exceptions,
but it is solr that complains. The customer manage to reproduce the
problem. Following is the error from the solr.log. The file type cause this
exception was WMZ. It seems that something is missing in a solr class. We
use SOLR 4.4.

ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.SolrException;
null:java.lang.RuntimeException: java.lang.NoSuchMethodError:
org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V
 at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:673)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:383)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
 at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
 at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
 at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
 at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
 at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
 at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
 at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)
 at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
 at
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1852)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NoSuchMethodError:
org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V
 at
org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:102)
 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
 at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
 at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
 ... 16 more





On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:


Hi Roland,


(13/10/17 20:44), Roland Everaert wrote:


Hi,

I helped a customer to deployed solr+manifoldCF and everything is going
quite smoothly, but every time solr is raising an exception, the
manifoldcfjob feeding

solr aborts. I would like to know if it is possible to configure the
ExtractRequestHandler to ignore errors like it seems to be possible with
dataimporthandler and entity processors.

I know that it is possible to configure the ExtractRequestHandler to
ignore
tika exception (We already do that) but the errors that now stops the
mcfjobs are generated by

solr itself.

While it is interesting to have such option in solr, I plan to post to the
manifoldcf mailing list, anyway, to know if it is possible to configure
manifolcf to be less picky about solr errors.



ignoreTikaException flag might help you?

https://issues.apache.org/**jira/browse/SOLR-2480https://issues.apache.org/jira/browse/SOLR-2480

koji
--
http://soleami.com/blog/**automatically-acquiring-**
synonym-knowledge-from-**wikipedia.htmlhttp://soleami.com/blog/automatically-acquiring-synonym-knowledge-from

Re: ExtractRequestHandler, skipping errors

2013-10-17 Thread Koji Sekiguchi


Hi Roland,

(13/10/17 20:44), Roland Everaert wrote:

Hi,

I helped a customer to deployed solr+manifoldCF and everything is going
quite smoothly, but every time solr is raising an exception, the
manifoldcfjob feeding
solr aborts. I would like to know if it is possible to configure the
ExtractRequestHandler to ignore errors like it seems to be possible with
dataimporthandler and entity processors.

I know that it is possible to configure the ExtractRequestHandler to ignore
tika exception (We already do that) but the errors that now stops the
mcfjobs are generated by
solr itself.

While it is interesting to have such option in solr, I plan to post to the
manifoldcf mailing list, anyway, to know if it is possible to configure
manifolcf to be less picky about solr errors.



ignoreTikaException flag might help you?

https://issues.apache.org/jira/browse/SOLR-2480

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: req info : SOLRJ and TermVector

2013-10-16 Thread Koji Sekiguchi


(13/10/16 17:47), elfu wrote:

hi,

can i access TermVector information using solrj ?


There is TermVectorComponent to get termVector info:

http://wiki.apache.org/solr/TermVectorComponent

So yes, you can access it using solrj.

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: fq caching question

2013-10-14 Thread Koji Sekiguchi


Hi Tim,

(13/10/15 5:22), Tim Vaillancourt wrote:

Hey guys,

Sorry for such a simple question, but I am curious as to the differences in 
caching between a
combined filter query, and many separate filter queries.

Here are 2 example queries, one with combined fq, one separate:

1) /select?q=*:*fq=type:bidfq=user_id:3
2) /select?q=*:*fq=(type:bid%20AND%20user_id:3)

For query #1: am I correct that the first query will keep 2 independent entries 
in the filterCache
for type:bid and user_id:3?\


Correct.


For query #2: is it correct that the 2nd query will keep 1 entry in the 
filterCache that satisfies
all conditions?


Correct.


Lastly, is it a fair statement that under general query patterns, many separate 
filter queries are
more-cacheable than 1 combined one? Eg, if I performed query #2 (in the 
filterCache) and then
changed the user_id, nothing about my new query is cache able, correct (but if 
I used 2 separate
filter queries than 1 of 2 is still cached)?


Yes, it is.

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: Please help!, Highlighting exact phrases with solr

2013-10-10 Thread Koji Sekiguchi


(13/10/10 18:17), Silvia Suárez wrote:

I am using solrj as client for indexing documents on the solr server I am
new to solr, And I am having problem with the highlighting in solr.
Highlighting exact phrases with solr does not work.

For example if the search keyword is: dulce hogar it returns:

span class=item dulce /span span class=item hogar /span

  And it should be:

span class=item dulce hogar /span

I don't understand which is the problem. Can someone  helpme please!?


Unfortunately, it is the feature.
FVH can support phrase-unit highlighting.

http://wiki.apache.org/solr/HighlightingParameters#hl.useFastVectorHighlighter

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: defType

2013-08-10 Thread Koji Sekiguchi


See line 33 to 50 at
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/core/src/java/org/apache/solr/search/QParserPlugin.java?view=markup

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

(13/08/11 8:05), William Bell wrote:

Can you list them out?

Thanks.

raw
lucene
dismax
edismax
field




On Sat, Aug 10, 2013 at 4:45 PM, Jack Krupansky j...@basetechnology.comwrote:


The full list is in my book. What did you need in particular?

(Actually, I forgot to add maxscore to my list.)

-- Jack Krupansky

-Original Message- From: William Bell Sent: Saturday, August 10,
2013 6:30 PM To: solr-user@lucene.apache.org Subject: defType
What are the possible options for defType?

lucene
dismax
edismax

Others?

--
Bill Bell
billnb...@gmail.com
cell 720-256-8076

Re: Proximity and highliting

2013-08-03 Thread Koji Sekiguchi


(13/08/04 14:36), Alex Cougarman wrote:

Hi all. I'm having some issues with highlighting and proximity searching in 
Solr 4.x. Matching words in the query are sometimes highlighted even if they 
are not within proximity and in some cases, matching words in the query are not 
highlighted at all. Does anyone know why this would be happening? Thanks.

-Alex



Do you set hl.usePhraseHighlighter parameter to true?

http://wiki.apache.org/solr/HighlightingParameters#hl.usePhraseHighlighter

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: ICUTransformFilterFactory

2013-08-02 Thread Koji Sekiguchi


(13/08/02 17:53), Jochen Lienhard wrote:

Hello,

we have a problem with some special characters: for example æ


We are using the ICUTranformFilterFactory for indexing and searching.

We have some documents with urianae and with urianæ

If I search urainae so I find only the versions with urianae but not the 
urianæ
Only if I search urainae* I find both versions.

Is it possible (perhaps by special IDs in the ICUTransformFilterFactory), so 
that I can find all
without an asterisk?


Why don't you use MappingCharFilter?

https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
(attached at https://issues.apache.org/jira/browse/SOLR-822 )

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: Sort by document similarity counts

2013-07-18 Thread Koji Sekiguchi


I have tried doing this via custom SearchComponent, where I can find all similar 
documents for each document in current search result, then add a new field into 
document hoping to use sort parameter (q=*sort=similarityCount).


I don't understand this part very well, but:


But this will not work because sort is done before handling my custom search 
component, if added via last-components. Can't add it via first-components, 
because then I will have no access to query results. And I do not want to 
override QueryComponent because I need to have all the functionality it covers: 
grouping, facets, etc.


You may want to put your custom SearchComponent to last-component and inject 
SortSpec
in your prepare() so that QueryComponent can sort the result complying with 
your SortSpec?

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: Find related words

2013-07-04 Thread Koji Sekiguchi


You may want collocations a given word? I've implemented LUCENE-474 for Solr
a while ago and I found it worked pretty well.

https://issues.apache.org/jira/browse/LUCENE-474

Hope this helps.

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

(13/07/04 21:09), Dotan Cohen wrote:

How might one find the top related words for a given word in a Solr index?

For instance, given the following single-field documents:
1: I love chocolate
2: I love Solr
3: I eat chocolate cake
4: You will eat chocolate candy

Thus, given the word Chocolate Solr might find these top words:
I (3 times matched)
eat (2 times matched)
love, cake, you, will, candy (1 time each)

Thanks!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Find related words

2013-07-04 Thread Koji Sekiguchi


Hi Dotan,

(13/07/04 23:51), Dotan Cohen wrote:

Thank you Jack and Koji. I will take a look at MLT and also at the
.zip files from LUCENE-474. Koji, did you have to modify the code for
the latest Solr?


Yes. As the Lucene APIs for accessing index have been changed,
I had to modify the code.

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: [blog post] Automatically Acquiring Synonym Knowledge from Wikipedia

2013-05-28 Thread Koji Sekiguchi


Hi Rajesh,

Thanks!
I'm planning to open an NLP tool kit for Lucene, and the tool kit will include
the following synonym library.

koji

(13/05/28 14:12), Rajesh Nikam wrote:

Hello Koji,

This is seems pretty useful post on how to create synonyms file.
Thanks a lot for sharing this !

Have you shared source code / jar for the same so at it could be used ?

Thanks,
Rajesh



On Mon, May 27, 2013 at 8:44 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:


Hello,

Sorry for cross post. I just wanted to announce that I've written a blog
post on
how to create synonyms.txt file automatically from Wikipedia:


http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Hope that the article gives someone a good experience!

koji
--

http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html






--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: Note on The Book

2013-05-27 Thread Koji Sekiguchi


Hi Jack,

I'd like to ask as a person who contributed a case study article about
Automatically acquiring synonym knowledge from Wikipedia to the book.

(13/05/24 8:14), Jack Krupansky wrote:

To those of you who may have heard about the Lucene/Solr book that I and two 
others are writing on Lucene and Solr, some bad and good news. The bad news: 
The book contract with O’Reilly has been canceled. The good news: I’m going to 
proceed with self-publishing (possibly on Lulu or even Amazon) a somewhat 
reduced scope Solr-only Reference Guide (with hints of Lucene). The scope of 
the previous effort was too great, even for O’Reilly – a book larger than 800 
pages (or even 600) that was heavy on reference and lighter on “guide” just 
wasn’t fitting in with their traditional “guide” model. In truth, Solr is just 
too complex for a simple guide that covers it all, let alone Lucene as well.


Will the reduced Solr-only reference guide include my article?
If not (for now I think it is not because my article is for Lucene case study,
not Solr), I'd like to put it out on my blog or somewhere.

BTW, those who want to know how to acquire synonym knowledge from Wikipedia,
the summary is available at slideshare:

http://www.slideshare.net/KojiSekiguchi/wikipediasolr

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

[blog post] Automatically Acquiring Synonym Knowledge from Wikipedia

2013-05-27 Thread Koji Sekiguchi

Hello,

Sorry for cross post. I just wanted to announce that I've written a blog post on
how to create synonyms.txt file automatically from Wikipedia:

http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Hope that the article gives someone a good experience!

koji
-- 
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Note on The Book

2013-05-27 Thread Koji Sekiguchi


Now my contribution can be read on soleami blog in English:

Automatically Acquiring Synonym Knowledge from Wikipedia
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

koji

(13/05/27 21:16), Jack Krupansky wrote:

If you would like to Solr-ize your contribution, that would be great. The focus 
of the book will be
hard-core Solr.

-- Jack Krupansky

-Original Message- From: Koji Sekiguchi
Sent: Monday, May 27, 2013 8:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Note on The Book

Hi Jack,

I'd like to ask as a person who contributed a case study article about
Automatically acquiring synonym knowledge from Wikipedia to the book.

(13/05/24 8:14), Jack Krupansky wrote:

To those of you who may have heard about the Lucene/Solr book that I and two 
others are writing on
Lucene and Solr, some bad and good news. The bad news: The book contract with 
O’Reilly has been
canceled. The good news: I’m going to proceed with self-publishing (possibly on 
Lulu or even
Amazon) a somewhat reduced scope Solr-only Reference Guide (with hints of 
Lucene). The scope of
the previous effort was too great, even for O’Reilly – a book larger than 800 
pages (or even 600)
that was heavy on reference and lighter on “guide” just wasn’t fitting in with 
their traditional
“guide” model. In truth, Solr is just too complex for a simple guide that 
covers it all, let alone
Lucene as well.


Will the reduced Solr-only reference guide include my article?
If not (for now I think it is not because my article is for Lucene case study,
not Solr), I'd like to put it out on my blog or somewhere.

BTW, those who want to know how to acquire synonym knowledge from Wikipedia,
the summary is available at slideshare:

http://www.slideshare.net/KojiSekiguchi/wikipediasolr

koji



--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: cache disable through solrJ

2013-05-20 Thread Koji Sekiguchi


(13/05/20 20:53), J Mohamed Zahoor wrote:

Hi

How do i disable cache (Solr FieldValueCache) for certain queries...
using HTTP it can be done using {!cache=false}...

how can i do it from solrj?

./zahoor



How about using facet.method=enum?

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Solr 3.6.1: changing a field from stored to not stored

2013-04-23 Thread Koji Sekiguchi


(13/04/24 7:09), Petersen, Robert wrote:

Hi guys,

What would happen if I changed a field definition on an existing field in an 
existing index from stored to not stored?  Would solr just party on ignoring 
the fact that this field's data is stored in the current index?  I noticed I am 
unnecessarily storing some fields in my index and I'd like to stop storing them 
without having to 'reindex the world' and let the changes just naturally 
percolate into my index as updates come in the normal course of things.  Do you 
guys think I could get away with this?

Thanks,

Robert (Robi) Petersen
Senior Software Engineer
 Search Engineer



I think Solr will just ignore the existing stored data.
But I've never to do it myself. Please try it.

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Returning similarity values for more like this search

2013-04-19 Thread Koji Sekiguchi


(13/04/19 23:24), Achim Domma wrote:

Hi,

I'm executing a search including a search for similar documents 
(mlt=truemlt.fl=) which works fine so far. I would like to get the 
similarity value for each document. I expected this to be quite common and simple, 
but I could not find a hint how to do it. Any hint how to do it would be very 
appreciated.

kind regards,
Achim



Using debugQuery=true, you can find explanations in the debug section of the 
response.

See:
https://issues.apache.org/jira/browse/SOLR-860

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: conditional queries?

2013-04-09 Thread Koji Sekiguchi


Hi Mark,

 Is it possible to do a conditional query if another query has no results?  For example, say I 
want to search against a given field for:


- Search for car.  If there are results, return them.
- Else, search for car* .  If there are results, return them.
- Else, search for car~ .  If there are results, return them.

Is this possible in one query?  Or would I need to make 3 separate queries by 
implementing this logic within my client?


As far as I know, there is no such SearchComponent.
But the idea of FallbackRequestHandler has been told, see SOLR-1878, for 
example:

https://issues.apache.org/jira/browse/SOLR-1878

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Flow Chart of Solr

2013-04-02 Thread Koji Sekiguchi


(13/04/02 21:45), Furkan KAMACI wrote:

Is there any documentation something like flow chart of Solr. i.e.
Documents comes into Solr(maybe indicating which classes get documents) and
goes to parsing process (i.e. stemming processes etc.) and then reverse
indexes are get so on so forth?



There is an interesting ticket:

Architecture Diagrams needed for Lucene, Solr and Nutch
https://issues.apache.org/jira/browse/LUCENE-2412

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Confusion over Solr highlight hl.q parameter

2013-04-02 Thread Koji Sekiguchi

(13/04/03 5:27), Van Tassell, Kristian wrote:
 Thanks Koji, this helped with some of our problems, but it is still not 
 perfect.
 
 This query, for example, returns no highlighting:
 
 ?q=id:abc123hl.q=text_it_IT:l'assiemehl.fl=text_it_IThl=truedefType=edismax
 
 But this one does (when it is, in effect, the same query):
 
 ?q=text_it_IT:l'assiemehl=truedefType=edismaxhl.fl=text_it_IT
 
 I've tried many combinations but can't seem to get the right one to work. Is 
 this possibly a bug?

As hl.q doesn't care defType parameter but does localParams,
can you try to put {!edismax} to hl.q parameter?

koji
-- 
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Getting back highlights almost always works...

2013-03-19 Thread Koji Sekiguchi


(13/03/20 6:14), Van Tassell, Kristian wrote:

...but I'm finding some examples where the stored text is so big (14,000 words) 
that Solr fails to highlight anything. But the data is definitely in the text 
field and is returning due to that hit.

Does anyone have any ideas why this happens?



Probably you are missing hl.maxAnalyzedChars parameter?

http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Retrieving Term vectors

2013-03-19 Thread Koji Sekiguchi


Hi Sarita,

I've not dug into your code detail but my first impression is that
you are missing store term positions?

 FieldType fieldType = new FieldType(); IndexOptions indexOptions = 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;

 fieldType.setIndexOptions(indexOptions);
 fieldType.setIndexed(true);
 fieldType.setStoreTermVectors(true);
 fieldType.setStored(true);
 Document doc = new Document();
 doc.add(new Field(content, one quick brown fox jumped over one lazy dog, 
fieldType));

I think you need:

fieldType.setStoreTermVectorPositions(true);

if you want term vector positions later.

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Incorrect snippets using FastVectorHighlighter

2013-03-18 Thread Koji Sekiguchi


Hi Jochen,

There is a restriction in FVH. FVH cannot deal with variable gram size.
That is, minGramSize == maxGramSize in your NGramFilterFactory setting.

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html


(13/03/18 22:17), Jochen Just wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi list,

i have the following field type in my schema.xml defined in order to be able to 
do in word search.

fieldType name=string_parts_back class=solr.TextField positionIncrementGap=100 
omitNorms=true
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.NGramFilterFactory minGramSize=1 
maxGramSize=1000/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType

Searching itself works as expected, though highlighting causes me headaches.
At first I did not use the FastVectorHighlighter, which meant highlighting did
not work at all for fields of this type. Since I'm using the 
FastVectorHighlighter
most of the time highlighting works, sometimes it doesn't.

Given I have a document containing the word 
'Superkalifragilistischexpialligetisch'
and I search for 'uperkalifragilistische', I would expect as result 
'Semuperkalifragilistische/emxpiallegetisch'
but it is 'Semuperkalifragilist/emischexpialligetisch'. So there is 'ische'
missing in the highlighted part.

Sadly, I am not able to create a simple setup to reproduce this, but it only 
happens in our in-house live system.
Though if I remove some fields from my qf attribute of the edismax parser in 
solconfig.xml, it stops behaving like that.
Some of those removed fields have the fieldType string_parts_back.

Does any one have a clue, what's going on?

Thanks in advance,
Jochen


- --
Jochen Just   Fon:   (++49) 711/28 07 57-193
avono AG  Mobil: (++49) 172/73 85 387
Breite Straße 2   Mail:  jochen.j...@avono.de
70173 Stuttgart   WWW:   http://www.avono.de
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBAgAGBQJRRxP5AAoJEP1xbhgWUHmSRAsP/AlLHWA6Pw6Jk5Pmr0rqiAxE
IsJ6HeL+4e56IHsKsruBY7HOGdEwRvXHSkwlKGLF+dvyzz4/lx7wbGBHJCMJJkDe
Yas9izso5z4KGKzKazMYPPKoXja67zmWmRU5PYG/exT8N1gjnA98KTzXAA47xIxA
rm9zUBImPF1eIZmEBcytI/+EMJI4Cy30OvRyWfc6XoxF7Kq5wJuMXvTWl24gM0tQ
xdPUVZ6ir8IkrGw2P7d3/IgaAtYbT+SEAuFjSE9rtS8KdJfWbXDYYupqNV59Syqh
7F5ywEOgnt/OBTODFp9FR4ElakOlSZrmRk8CgYfUZZu9vNASxyBnCWwhz+CkCbfQ
fYRzy1HyDUGIGFl6FAi+4WE4av5EdWUH6N0UEdUkE6tI5b/IqzGIdocSl36PqeMR
za7jKfU9LWqc+Xoh27wLP8Wi11t/XIRQuRCxKSFpc2Go3iweCTu+cXr1K6XTndj/
uoptQ1nJJcQTRmdvxlxA5jvrVaGvOclEEFsndQWyq6wK7CJ9k+FOHfYwc7p3L1Bp
QoTTErdEKgCZj+w39Ma0ASURBX1+jjLqRnMvleSD4CX2K78z8Z7c5a7m48192D6u
mg6uOIUyTdTPH5SLUOU+rNDjOuLLbJOuVGXdpSqYymkr2WPlwwBj+ZYGx1lap1xE
5ZgU5nHnodtUAC9jjz52
=KsNm
-END PGP SIGNATURE-

Re: Incorrect snippets using FastVectorHighlighter

2013-03-18 Thread Koji Sekiguchi


So just to be clear:
There is no possibility to highlight results, if I use variable gram size.
Neither the original highlighter nor FVH do the job.
Or am I missing something?


I don't know the latest original highlighter has such restriction or not today,
but when FVH came in 2.9, at that time, the original highlighter couldn't
deal with n-gram field if n  1, because (k)-th term's end offset can be
larger than (k+1)-th term's start offset.


Btw does any documentation exits how the VFH works?


See package summary:

http://lucene.apache.org/core/4_2_0/highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Confusion over Solr highlight hl.q parameter

2013-03-16 Thread Koji Sekiguchi

(13/03/16 4:08), Van Tassell, Kristian wrote:
 Hello everyone,
 
 If I search for a term “baz” and tell it to highlight it, it highlights just 
 fine.
 
 If, however, I search for “foo bar” using the q parameter, which appears in 
 that same document/same field, and use the hl.q parameter to search and 
 highlight “baz”, I get no highlighting results for “baz”.
 
 ?q=パーツにおける機能強化
 qf=text_ja_JP
 defType=edismax
 hl=true
 hl.simple.pre=em
 hl.simple.post=/em
 hl.fl=text_ja_JP
 
 The above highlights query term just fine.
 
 ?q=1234
 hl.q=パーツにおける機能強化
 qf=id
 defType=edismax
 hl=true
 hl.simple.pre=em
 hl.simple.post=/em
 hl.fl=text_ja_JP
 
 This one returns zero highlighting hits.

I'm just guessing, Solr highlighter tries to highlight パーツにおける機能強化 in your
default search field? Can you try hl.q=text_ja_JP:パーツにおける機能強化 .

koji
-- 
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: how to overrride pre and post tags when usefastVectorHighlighter is set to true

2013-02-23 Thread Koji Sekiguchi


Hi Alex,

(13/02/23 10:53), alx...@aim.com wrote:

Hello,

I was unable to change pre and post tags for highlighting when 
usefastVectorHighlighter is set to true. Changing default tags in 
solrconfig.xml works for standard highlighter though. I searched mailing list 
and the net with no success.
I use solr-4.1.0.


According to Wiki:

hl.simple.pre/hl.simple.post
http://wiki.apache.org/solr/HighlightingParameters#hl.simple.pre.2BAC8-hl.simple.post

... Use hl.tag.pre and hl.tag.post for FastVectorHighlighter (see example under 
hl.fragmentsBuilder)

And solrconfig.xml in example:

  !-- multi-colored tag FragmentsBuilder --
  fragmentsBuilder name=colored
class=solr.highlight.ScoreOrderFragmentsBuilder
lst name=defaults
  str name=hl.tag.pre![CDATA[
   b style=background:yellow,b style=background:lawgreen,
   b style=background:aquamarine,b style=background:magenta,
   b style=background:palegreen,b style=background:coral,
   b style=background:wheat,b style=background:khaki,
   b style=background:lime,b 
style=background:deepskyblue]]/str
  str name=hl.tag.post![CDATA[/b]]/str
/lst
  /fragmentsBuilder

If you don't use multi-colored tag, you can simply set:

  fragmentsBuilder name=simpletag
class=solr.highlight.ScoreOrderFragmentsBuilder
lst name=defaults
  str name=hl.tag.pre![CDATA[b]]/str
  str name=hl.tag.post![CDATA[/b]]/str
/lst
  /fragmentsBuilder

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Order by hl.snippets count

2012-11-19 Thread Koji Sekiguchi


(12/11/20 1:50), Gabriel Croitoru wrote:

Hello,
I'm using  Solr 1.3 with http://wiki.apache.org/solr/HighlightingParameters 
options.
The client just asked us to change the order from the default score to the 
number of hl.snippets per
document.

It's this posibble from Solr configuration? (without implementing a custom 
scoring algorithm)?


I don't think it is possible.

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Patch Needed for Issue Solr-3790

2012-11-09 Thread Koji Sekiguchi


(12/11/09 19:20), mechravi25 wrote:

Hi All,

Im using Solr 3.6.1 version. For the issue given in the following url, there
is no patch file provided

https://issues.apache.org/jira/browse/SOLR-3790

Can you tell me if there is patch file for the same?

Also, We noticed that the below url had the changes that had to be done to
resolve this issue. In this, only one file SolrIndexSearcher,java was
changed by including,

synchronized(this) above the line
'if (storedHighlightFieldNames == null) {' inside the
'public CollectionString getStoredHighlightFieldNames()' method

http://svn.apache.org/viewvc/lucene/dev/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java?r1=1229401r2=1231606diff_format=h

Can anyone confirm me if this is the only change to resolve the same?


Yes, it is the only change to resolve the problem, I think.

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: SLOR And OpenNlp integration

2012-10-11 Thread Koji Sekiguchi


(12/10/11 20:40), ahmed wrote:

Hi, Thanks for reply
i fact i tried this tutorial but when i execute  'ant compile' i have
probleme taht class not found despite the class a re their.I dont know wats
the probleme



I think if you attach the error you got helps us to understand your problem.
Also before then what do you want to do with Solr and OpenNLP integration?

koji
--
http://soleami.com/blog/starting-lab-work.html

Re: Regarding delta-import and full-import

2012-09-27 Thread Koji Sekiguchi


(12/09/27 22:45), darshan wrote:

Hi All,

 Can anyone refer me few number blogs that explains both
imports in little bit more detail and with examples.



Thanks,

Darshan




Asking Google, I got:

http://www.arunchinnachamy.com/apache-solr-mysql-data-import/
http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx
http://pooteeweet.org/blog/1827

:

koji
--
http://soleami.com/blog/starting-lab-work.html

Re: solr binary protocol

2012-09-26 Thread Koji Sekiguchi


(12/09/27 9:29), Radim Kolar wrote:

Its possible to use SOLR binary protocol instead of xml for taking TO SOLR? I 
know that it can be
used in Solr reply.



Have you looked javabin?

http://wiki.apache.org/solr/javabin

koji
--
http://soleami.com/blog/starting-lab-work.html

Re: Broken highlight truncation for hl.alternateField

2012-09-14 Thread Koji Sekiguchi

Hi Arcadius,

I think it is a feature. If no match terms found on hl.fl fields then it
triggers
hl.alternateField function, and if you set hl.maxAlternateFieldLength=[LENGTH],
the highlighter extracts the first [LENGTH] characters of stored data of the
hl.fl
field. As this is the common feature of both highlighter and FVH, it doesn't
take
into account hl.bs.type (it is a special param for boundary scanner).

For now, implement boundary scanning in your client if you want.

koji
--
http://soleami.com/blog/starting-lab-work.html

(12/09/15 0:13), Arcadius Ahouansou wrote:

Hello.

I am using the fastVectorHighlighter in Solr3.5 to highight and truncate
the summary of my results.

The standard breakIterator is being used with hl.bs.type = WORD as
per http://lucidworks.lucidimagination.com/display/solr/Highlighting

Search is being performed on the document title and summary.

In my edismax requesthandler, I have as default:

str name=hl.useFastVectorHighlightertrue/str
str name=hl.flsummary/str
str name=f.summary.hl.alternateFieldsummary/str

A simplified query looks like this:

/solr/search?q=helphl=truef.summary.hl.fragsize=250f.summary.hl.maxAlternateFieldLength=250

So, I am truncating only the summary.

1- When a search term is found in the decription, everyting works well as
expected
and the summary is truncated and contains whole words only (the
breakIterator is being applied properly)

2- However, when there is no match in the summary, then
the f.summary.hl.alternateField quicks-in and the summary returned is often
truncated in the middle of a word (i.e we may get peo instead of
people).
This lets me suppose that the breakIterator is not applied to
f.summary.hl.alternateField.

My question is: how to return full word truncation when summary is fetched
from f.summary.hl.alternateField ? (i.e no match in summary)
Or is there any other way I could get proper truncation when there is no
match in the summary?

Thank you very much.

Arcadius

Re: Doubts in PathHierarchyTokenizer

2012-09-12 Thread Koji Sekiguchi


Use delimiter option instead of pattern for PathHierarchyTokenizerFactory:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PathHierarchyTokenizerFactory

koji
--
http://soleami.com/blog/starting-lab-work.html

(12/09/12 22:22), mechravi25 wrote:

Hi,

Im Using Solr 3.6.1 version and I have a field which is having values like

A|B|C
B|C|D|EE
A|C|B
A|B|D
..etc..

So, When I search for A|B, I should get documents starting with
A and A|B

To implement this, I've used PathHierarchyTokenizer for the above field as


fieldType name=filep class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.PathHierarchyTokenizerFactory pattern=|/
  /analyzer
analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory /
/analyzer
/fieldType

But, When I use the solr analysis page to check if its being split on the
pipe symbol (|) on indexing, I see that its being taken as the entire
token and its not getting split on the delimiter (i.e. the searching is done
only for A|B in the above case)

I also tried using \| as the delimiter but also its not working.

Am I missing anything here? Or Will the Path Hierarchy not accept pipe
symbol (|) as delimiter?
Can anyone guide me on this?

Thanks a lot



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Doubts-in-PathHierarchyTokenizer-tp4007216.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PathHierarchyTokenizerFactory behavior

2012-07-09 Thread Koji Sekiguchi


(12/07/09 19:41), Alok Bhandari wrote:

Hello,

this is how the field is declared in schema.xml

fieldType name=text_path class=solr.TextField stored=true
indexed=true positionIncrementGap=100
   analyzer
 tokenizer class=solr.PathHierarchyTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
   /analyzer
/fieldType

when I query for this filed with input
M:/Users/User/AppData/Local/test/abc.txt .
It searches for documents containing any of the token generated M,Users,
User  etc.but I want to search for exact file with the given input as a
value. Please let me know how I can achieve that. I am using solr 3.6.thanks


Can you try KeywordTokenizerFactory instead of PathHierarchyTokenizerFactory?

koji
--
http://soleami.com/blog/starting-lab-work.html

Re: using Carrot2 custom ITokenizerFactory

2012-05-21 Thread Koji Sekiguchi


My problem was gone. Thanks Staszek and Dawid!

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/


(12/05/21 18:11), Stanislaw Osinski wrote:

Hi Koji,

Dawid came up with a simple fix for this, it's committed to trunk and 3.6
branch.

Staszek

using Carrot2 custom ITokenizerFactory

2012-05-20 Thread Koji Sekiguchi

Hello,

As I'd like to use custom ITokenizerFactory, I set the following Carrot2 key
in solrconfig.xml:

  searchComponent name=clustering
   enable=${solr.clustering.enabled:true}
   class=solr.clustering.ClusteringComponent 
lst name=engine
  str name=namedefault/str
 :
  str 
name=PreprocessingPipeline.tokenizerFactorymy.own.TokenizerFactory/str
/lst
  /searchComponent

But seems that CarrotClusteringEngine overwrites it with 
LuceneCarrot2TokenizerFactory
in init() method:

BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes)
.stemmerFactory(LuceneCarrot2StemmerFactory.class)
.tokenizerFactory(LuceneCarrot2TokenizerFactory.class)
.lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class);

Am I missing something?

koji
-- 
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: using Carrot2 custom ITokenizerFactory

2012-05-20 Thread Koji Sekiguchi

Hi Staszek,

I'll wait your fix. Thank you!

Koji Sekiguchi from iPad2

On 2012/05/20, at 18:18, Stanislaw Osinski stanis...@osinski.name wrote:

 Hi Koji,
 
 You're right, the current code overwrites the custom tokenizer though it
 shouldn't. LuceneCarrot2TokenizerFactory is there to avoid circular
 dependencies (Carrot2 default tokenizer depends on Lucene), but it
 shouldn't be an issue with custom tokenizers.
 
 I'll try to commit a fix later today. Meanwhile, if you have a chance to
 recompile the code, a temporary solution would be to hardcode your
 tokenizer class into the fragment you pasted:
 
   BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes)
   .stemmerFactory(LuceneCarrot2StemmerFactory.class)
   .tokenizerFactory(YourCustomTokenizer.class)
   .lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class);
 
 Staszek
 
 On Sun, May 20, 2012 at 9:40 AM, Koji Sekiguchi k...@r.email.ne.jp wrote:
 
 Hello,
 
 As I'd like to use custom ITokenizerFactory, I set the following Carrot2
 key
 in solrconfig.xml:
 
 searchComponent name=clustering
  enable=${solr.clustering.enabled:true}
  class=solr.clustering.ClusteringComponent 
   lst name=engine
 str name=namedefault/str
:
 str
 name=PreprocessingPipeline.tokenizerFactorymy.own.TokenizerFactory/str
   /lst
 /searchComponent
 
 But seems that CarrotClusteringEngine overwrites it with
 LuceneCarrot2TokenizerFactory
 in init() method:
 
   BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes)
   .stemmerFactory(LuceneCarrot2StemmerFactory.class)
   .tokenizerFactory(LuceneCarrot2TokenizerFactory.class)
   .lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class);
 
 Am I missing something?
 
 koji
 --
 Query Log Visualizer for Apache Solr
 http://soleami.com/

Re: Newbie with Carrot2?

2012-05-20 Thread Koji Sekiguchi


(12/05/20 23:21), Xue-Feng Yang wrote:

Hi Staszek,

I haven't found a way for inputting data into solr in the wiki. Does that mean 
docs can be inputted in a normal solr way after configuration? for example, DIH 
or solrj.

Thanks,

Xue-Feng


Right, because Carrot2 clustering is for search time.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: using Carrot2 custom ITokenizerFactory

2012-05-20 Thread Koji Sekiguchi

.util.attribute.AttributeBinder.set(AttributeBinder.java:129)
at org.carrot2.core.ControllerUtils.init(ControllerUtils.java:50)
	at 
org.carrot2.core.PoolingProcessingComponentManager$ComponentInstantiationListener.objectInstantiated(PoolingProcessingComponentManager.java:189)

... 30 more
Caused by: java.lang.IllegalArgumentException: Can not set 
org.carrot2.text.linguistic.ITokenizerFactory field 
org.carrot2.text.preprocessing.pipeline.BasicPreprocessingPipeline.tokenizerFactory to java.lang.String
	at 
sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:146)
	at 
sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:150)

at 
sun.reflect.UnsafeObjectFieldAccessorImpl.set(UnsafeObjectFieldAccessorImpl.java:63)
at java.lang.reflect.Field.set(Field.java:657)
	at 
org.carrot2.util.attribute.AttributeBinder$AttributeBinderActionBind.performAction(AttributeBinder.java:610)

... 37 more


I should dig in, but if you have any clue, it would be appreciated. I'm using 
3.6 branch.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

(12/05/20 21:11), Stanislaw Osinski wrote:

Hi Koji,

It's fixed in trunk and 3.6.1 branch now. If you hit any other issues with
this, let me know.

Staszek

On Sun, May 20, 2012 at 1:02 PM, Koji Sekiguchik...@r.email.ne.jp  wrote:


Hi Staszek,

I'll wait your fix. Thank you!

Koji Sekiguchi from iPad2

On 2012/05/20, at 18:18, Stanislaw Osinskistanis...@osinski.name  wrote:


Hi Koji,

You're right, the current code overwrites the custom tokenizer though it
shouldn't. LuceneCarrot2TokenizerFactory is there to avoid circular
dependencies (Carrot2 default tokenizer depends on Lucene), but it
shouldn't be an issue with custom tokenizers.

I'll try to commit a fix later today. Meanwhile, if you have a chance to
recompile the code, a temporary solution would be to hardcode your
tokenizer class into the fragment you pasted:

   BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes)
   .stemmerFactory(LuceneCarrot2StemmerFactory.class)
   .tokenizerFactory(YourCustomTokenizer.class)
   .lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class);

Staszek

On Sun, May 20, 2012 at 9:40 AM, Koji Sekiguchik...@r.email.ne.jp

wrote:



Hello,

As I'd like to use custom ITokenizerFactory, I set the following Carrot2
key
in solrconfig.xml:

searchComponent name=clustering
  enable=${solr.clustering.enabled:true}
  class=solr.clustering.ClusteringComponent
   lst name=engine
 str name=namedefault/str
:
 str


name=PreprocessingPipeline.tokenizerFactorymy.own.TokenizerFactory/str

   /lst
/searchComponent

But seems that CarrotClusteringEngine overwrites it with
LuceneCarrot2TokenizerFactory
in init() method:

   BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes)
   .stemmerFactory(LuceneCarrot2StemmerFactory.class)
   .tokenizerFactory(LuceneCarrot2TokenizerFactory.class)
   .lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class);

Am I missing something?

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: Is it possible to limit the bandwidth of replication

2012-05-07 Thread Koji Sekiguchi


(12/05/07 15:38), James wrote:

I notice the index replication utilize the full bandwidth. So the normal query 
stalled. Is there any method to control the bandwidth of replication?



I don't know the status of Java based replication, but there is bwlimit
option for your problem for script based replication.

https://issues.apache.org/jira/browse/SOLR-2099

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: Solr 3.5 - Elevate.xml causing issues when placed under /data directory

2012-05-02 Thread Koji Sekiguchi

(12/05/03 1:39), Noordeen, Roxy wrote:

Hello,
I just started using elevation for solr. I am on solr 3.5, running with Drupal
7, Linux.

1. I updated my solrconfig.xml
from
dataDir${solr.data.dir:./solr/data}/dataDir

To
dataDir/usr/local/tomcat2/data/solr/dev_d7/data/dataDir

2. I placed my elevate.xml in my solr's data directory. Based on forum answers,
I thought placing elevate.xml under data directory would pick my latest change.
I restarted tomcat.

3. When i placed my elevate.xml under conf directory, elevation was working
with url:

http://mysolr.www.com:8181/solr/elevate?q=gameswt=xmlsort=score+descfl=id,bundle_namehttp://p6solr1.cube6.wwe.com:8181/solr/elevate?q=gameswt=xmlfl=id,bundle_name

But when i moved to data directory, I am not seeing any results.

NOTE: I can see the catalina.out, printing solr reading the file from data
directory. I tried to give invalid entries; I noticed solr errors parsing
elevate.xml from data directory. I even tried to send some documents to index,
thought commit might help to read the elevate config file. But nothing helped.

I don't understand why below url does not work anymore. There are no errors in
the log files.

http://mysolr.www.com:8181/solr/elevate?q=gameswt=xmlsort=score+descfl=id,bundle_namehttp://p6solr1.cube6.wwe.com:8181/solr/elevate?q=gameswt=xmlfl=id,bundle_name

Any help on this topic is appreciated.

Hi Noordeen,

What do you mean by I am not seeing any results.? Is it no docs in response
(numFound=0) ?

And have you tried the original ${solr.data.dir:./solr/data} for the dataDir?
Isn't it working for you too?

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: How to integrate sen and lucene-ja in SOLR 3.x

2012-05-01 Thread Koji Sekiguchi


(12/05/02 1:47), Shanmugavel SRD wrote:

Hi,
   Can anyone help me on how to integrate sen and lucene-ja.jar in SOLR 3.4
or 3.5 or 3.6 version?


I think lucene-ja.jar no longer exists in Internet and doesn't work with
Lucene/Solr 3.x because interface doesn't match (lucene-ja doesn't know
AttributeSource).

Use lucene-gosen which is the descendant project of sen/lucene-ja instead.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: Solr: Highlighting word parts in excerpt does not work

2012-04-05 Thread Koji Sekiguchi


(12/04/05 15:34), Thomas Werthmüller wrote:

Hi

I configured solr that also word parts are found. When is search Monday
or Mond the right document is found. This is done with the following
configuration in the schema.xml:filter
class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=30/.

Now, when I add hl=true to the query sting, the excerpt for Monday looks
good and the word is highlighted. When i search only with Mond, the
document is found but no excerpt is returned because the query sting is not
the whole word.

I hope someone can give me a hint that also excerpts returned with word
parts.

Thanks!
Thomas


Hi Thomas,

Highlighter doesn't support N-gram field, I think. (Or does it support N-gram
field recently?) FastVectorHighlighter does support such fields but 
fixed-gram-size
only, e.g. minGramSize=3 maxGramSize=3.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: Why my highlights are wrong(one character offset)?

2012-03-27 Thread Koji Sekiguchi


How does your sequence field look like in schema.xml, fieldType and field?
And what version are you using?

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

(12/03/27 13:06), neosky wrote:

all of my highlights has one character mistake in the offset,some fragments
from my response. Thanks!

response
lst name=responseHeader
int name=status0/int
int name=QTime259/int
lst name=params
str name=explainOther/
str name=indenton/str
str name=hl.flsequence/str
str name=wt/
str name=hltrue/str
str name=rows10/str
str name=version2.2/str
str name=fl*,score/str
str name=hl.useFastVectorHighlightertrue/str
str name=start0/str
str name=qsequence:NGNFN/str
str name=qt/
str name=fq/
/lst
/lst
lst name=highlighting
lst name= B9SUS0 
arr name=sequence
strTSQSELemSNGNF/emNRRPKIELSNFDGNHPKTWIRKC/str
/arr
/lst
lst name= Q01GW2 
arr name=sequence
strGENTREemRNGNF/emNSLTRERSFAELENHPPKVRRNGSEG/str
/arr
/lst
lst name= C5L0V0 
arr name=sequence
strEGRYPCemNNGNF/emNLTTGRCVCEKNYVHLIYEDRI/str
/arr
/lst
lst name= C4JX93 
arr name=sequence
strYAEENYemINGNF/emNEEPY/str
/arr
/lst
lst name= D7CK80 
arr name=sequence
strKEVADDemCNGNF/emNQPTGVRI/str
/arr
/lst
/lst
/response

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Why-my-highlights-are-wrong-one-character-offset-tp3860283p3860283.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Reporting tools

2012-03-09 Thread Koji Sekiguchi


(12/03/09 12:35), Donald Organ wrote:

Are there any reporting tools out there?  So I can analyzer search term
frequency, filter frequency,  etc?


You may be interested in:

Free Query Log Visualizer for Apache Solr
http://soleami.com/

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: Help with Synonyms

2012-03-05 Thread Koji Sekiguchi


(12/03/06 0:11), Donald Organ wrote:

Try to remove tokenizerFactory=**KeywordTokenizerFactory in your
synonym filter
definition because I think you would want to tokenize the synonym settings
in
synonyms.txt as floor / locker =  storage / locker. But if you set
it
to KeywordTokenizer, it will be a map of floor locker =  storage
locker, and as you
are using WhitespaceTokenizer for yourtokenizer/  inanalyzer/, then
if you
try to index floor locker, it will be floor/locker (not floor
locker),
as a result, it will not match to your synonym map.

Aside, I recommend that you would setcharFilter/  -tokenizer/  -
filter/
chain in the natural order inanalyzer/, though if those are wrong it
won't
be the cause of the problem at all.




OK so I have updated my schema.xml to the following:

fieldType name=text class=solr.TextField positionIncrementGap=100
omitNorms=false
   analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=false/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 /
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /
 filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.RemoveDuplicatesTokenFilterFactory /
   /analyzer
   .

I am still getting results for storage locker  and no results for floor
locker

synonyms.txt still looks like this:

floor locker=storage locker


Hi Donald,

Do you use same SynonymFilter setting to the query analyzer part
(analyzer type=query)?

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: Help with Synonyms

2012-03-05 Thread Koji Sekiguchi


(12/03/06 11:07), Donald Organ wrote:

No I do synonyms at index time.


:

I am still getting results for storage locker  and no results for floor
locker

synonyms.txt still looks like this:

floor locker=storage locker


So that's the cause of the problem. Due to the definition floor locker=storage 
locker
on index time analysis, you got storage / locker in your index, no floor 
terms
in your index at all. In general, if you use = method in your synonyms.txt,
you should apply same rule to both index and query time.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: Help with Synonyms

2012-03-05 Thread Koji Sekiguchi


(12/03/06 11:23), Donald Organ wrote:

Ok so do I need to use a different format in my synonyms.txt file in order
to do this at index time?



Right, if you want to apply synonym rules to only index time.
Use , like this:

floor locker, storage locker

And don't forget to set expand=true in your index time synonym definition.
This makes if you have floor locker in your document, it will be expanded not 
only
floor locker but also storage locker in index, then you can search
the document by any of q=floor locker or storage locker.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: nutch log

2012-03-03 Thread Koji Sekiguchi


(12/03/03 20:32), alessio crisantemi wrote:

this is my nutch log after configured it for solr index:


:

org.apache.solr.common.SolrException: Internal Server Error
Internal Server Error
request: http://localhost:8983/solr/update?wt=javabinversion=2
  at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)

:

suggestions?
thanks
alessio

Hi alessio,

I have no ideas for nutch, but I think you can look for the cause of the 
internal server
error in Solr log, not in nutch log.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: nutch log

2012-03-03 Thread Koji Sekiguchi


(12/03/04 0:09), alessio crisantemi wrote:

is true.
this is the slr problem:
mar 03, 2012 12:08:04 PM org.apache.solr.common.SolrException log
Grave: org.apache.solr.common.SolrException: invalid boolean value:


Solr said that there was an erroneous boolean value in your solrconfig.xml.
Check the values of bool.../bool of your solr plugins in solrconfig.xml.
Those should be one of true/false/on/off/...

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: nutch log

2012-03-03 Thread Koji Sekiguchi


It is not solr error. Consult nutch/hadoop mailing list.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

(12/03/04 2:38), alessio crisantemi wrote:

now,
  I solve the boolean problem.

but my indexing don't works now also..

But this time, I don't have error in tomcat log and not error in nutch log.
I see only this code on cygwin window:

Exception in thread main org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120303171628/parse_data

at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)

at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)

at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)

at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)

at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)

at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)

at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)

at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)


why, in your opinion?
thanks again
alessio
Il giorno 03 marzo 2012 16:43, Koji Sekiguchik...@r.email.ne.jp  ha
scritto:


(12/03/04 0:09), alessio crisantemi wrote:


is true.
this is the slr problem:
mar 03, 2012 12:08:04 PM org.apache.solr.common.**SolrException log
Grave: org.apache.solr.common.**SolrException: invalid boolean value:



Solr said that there was an erroneous boolean value in your solrconfig.xml.
Check the values ofbool.../bool  of your solr plugins in
solrconfig.xml.
Those should be one of true/false/on/off/...


koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Re: Help with Synonyms

2012-03-02 Thread Koji Sekiguchi


(12/03/03 1:39), Donald Organ wrote:

I am trying to get synonyms working correctly, I want to map  floor locker
   tostorage locker

currently searching for storage locker produces results were as searching
for floor locker  does not produce any results.
I have the following setup for index time synonyms:

fieldType name=text class=solr.TextField positionIncrementGap=100
omitNorms=false
   analyzer type=index
 filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true
tokenizerFactory=KeywordTokenizerFactory/
 charFilter class=solr.HTMLStripCharFilterFactory/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 /
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /
 filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.RemoveDuplicatesTokenFilterFactory /
 tokenizer class=solr.WhitespaceTokenizerFactory/
   /analyzer

And my synonyms.txt looks like this:

floor locker=storage locker

What am I doing wrong?


Hi Donald,

Try to remove tokenizerFactory=KeywordTokenizerFactory in your synonym filter
definition because I think you would want to tokenize the synonym settings in
synonyms.txt as floor / locker = storage / locker. But if you set it
to KeywordTokenizer, it will be a map of floor locker = storage locker, 
and as you
are using WhitespaceTokenizer for your tokenizer/ in analyzer/, then if you
try to index floor locker, it will be floor/locker (not floor locker),
as a result, it will not match to your synonym map.

Aside, I recommend that you would set charFilter/ - tokenizer/ - filter/
chain in the natural order in analyzer/, though if those are wrong it won't
be the cause of the problem at all.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

1 2 3 4 5 6 >

1 - 100 of 520 matches

Mail list logo