Re: ranking on Multivalued fields

2008-03-11 Thread Tobias Lohr
What you probably want to achieve is displaying only docs in a certain 
category (maybe filtered) ordered by descending score in the context of 
exactly this category, right?


Well, you could come over this by creating a category specific score 
field for every category following the schema cat-X-score where X is 
the identifier of each of your categories. Then when receiving a request 
for your category you programmatically have to build the sort-by 
condition for field cat-Y-score, where Y is the category id of the 
category you received the request for.


*tobi*

Umar Shah wrote:

Hi Otis,

thanks for the reply,

consider a multivalued field name cat
doc
--other fields

field name=cat val 1 field name=catrank score1 /field /field
field name=cat val 2 field name=catrank score2 /field /field
field name=cat val 3 field name=catrank score3 /field /field
..

--other fields

doc

the query i have to use is
q= cat:query-text; sort catrank desc

get all the documents
WITH field  cat HAVING  query-text
AND order by catrank desc

On 3/8/08, Otis Gospodnetic [EMAIL PROTECTED] wrote:
  

Umar,

I'm not sure what you mean by a subfield, can you explain please?

As for your second question, just add category:X to your query and you'll
get matches ordered/ranked by score by default.

Otis


--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
From: Umar Shah [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Friday, March 7, 2008 1:17:35 AM
Subject: ranking on Multivalued fields

Hi,

I have a problem where i want to rank multivalued fields

suppose a multivalued field category having associated subfield score.
First Is it possible to have a subfield in the mutlivalued field?
Second I want to get the documents ranked with the highest score say for
the
category:X

thanks
Umar Shah







  




Re: What is default Date time format in Solr

2008-03-11 Thread Mahesh Udupa
Thanks Chris,

My index creation was wrong ;)(I was using 12 Hour format)

Thanks for your support
-kmu

On Sat, Mar 8, 2008 at 1:35 AM, Chris Hostetter [EMAIL PROTECTED]
wrote:


 : I heard Solr Date time format is 24 hours.

 that is correct.

 : emf.artist:[2007-12-31T22:20:00Z TO  2007-12-31T22:39:00Z]
 :
 : I am not able to get the content what I expected.
 :
 : But, I tried with following query:-
 :
 : emf.artist:[2007-12-31T10:20:00Z TO  2007-12-31T10:39:00Z]

 Is your emf.artist field stored?
 If so what value do you see in the field when you do that second query and
 get the results you are looking for?  if they don't match what you think
 they should be, then the code you have reading dates from your index and
 writing them to Solr isn't doing what you think it's doing.




 -Hoss




Re: Accented search

2008-03-11 Thread Peter Cline
I'm not sure about a way to boost scores in this case, but you can 
achieve the basic matching by applying a filter to the index and the 
queries.  The ISOLatin1Accent Filter seems like it may work for you, 
though I'm not entirely certain if that will cover all the accent 
characters you need.


My approach has been to write new filters, one to normalize the unicode 
into the decomposed version, then one to manually strip out all of the 
add-on characters (with decimal codepoint greater than 256).  I don't 
know if this will always work, but it's worked well for me so far.


I would test out adding a filter class=ISOLatin1AccentFilterFactory/ 
to your analyzer.  It might do the trick.  Once again, with this 
approach I'm not sure how to boost either score, so someone else may 
have better ideas.  I'm pretty new to all of this stuff.


Peter

climbingrose wrote:

Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to
hear some ideas about how to use Solr with those languages. Basically, I
want to achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:Lập Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters Lập Trình Viên, then Doc B is also matched and Lập
Trình Viên is highlighted.
  On the other hand, if the query is Lap Trinh Vien, Doc A is also
matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters Lập Trình Viên, then Doc A should be given higher
score than DOC B.
  if the query is Lap Trinh Vien, Doc A should be given higher score.

Any ideas guys? Thanks in advance!

  


RE: Accented search

2008-03-11 Thread Binkley, Peter
We've done this in a pre-Solr Lucene context by using the position increment: 
when a token contains accented characters, you add a stripped version of that 
token with a zero increment, so that for matching purposes the original and the 
stripped version are at the same position. Accents are not stripped from 
queries. The effect is that an accented search matches your Doc A, and an 
unaccented search matches Docs A and B. We do that after lower-casing the token.

There are some limitations: users might start to expect that they can freely 
add accents to restrict their search to accented hits, but if they don't match 
the accents exactly they won't get any hits: e.g. if a word contains two 
accented characters and the user only accents one of them in their query, they 
won't match the accented or the unaccented version. 

Peter

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [EMAIL PROTECTED]

~ The code is willing, but the data is weak. ~


-Original Message-
From: climbingrose [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 10, 2008 10:01 PM
To: solr-user@lucene.apache.org
Subject: Accented search

Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to hear 
some ideas about how to use Solr with those languages. Basically, I want to 
achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:Lập Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters Lập Trình Viên, then Doc B is also matched and Lập 
Trình Viên is highlighted.
  On the other hand, if the query is Lap Trinh Vien, Doc A is also matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters Lập Trình Viên, then Doc A should be given higher score 
than DOC B.
  if the query is Lap Trinh Vien, Doc A should be given higher score.

Any ideas guys? Thanks in advance!

--
Regards,

Cuong Hoang


schema help

2008-03-11 Thread Geoffrey Young

hi :)

I'm trying to work out a schema for our widgets.  more than just coming 
up with something I'd like something idiomatic in solr terms.  any help 
is much appreciated.  here's a similar problem space to what I'm working 
with...


lets say we're talking books.  books are written by authors and held in 
libraries.  a sister company is using lucene+compass and they seem to 
have completely different collections (or whatever the technical term is :)


  authors
  books
  libraries

so that a search for authors hits only the authors dataset.

all of the solr examples I can find don't seem to address this kind of 
data disparity.  what is the standard and idiomatic approach for solr?


for my particular data I'd want to display something like this

  author
book in library
book in library

on the same result page, but using a completely flat, single schema 
doesn't seem to scale very well.


collective widsom most welcome :)

--Geoff


RE: Accented search

2008-03-11 Thread Renaud Waldura
Peter:

Very interesting. To take care of the issue you mention, could you add
multiple synonyms with progressively less accents? 

E.g. you'd index préférence as 4 tokens:
 préférence (unchanged)
 preférence (stripped one accent)
 préference (stripped the other accent)
 preference (stripped both accents)

Or does it yield too many tokens to be useful?

And how does this take care of scoring? Do you get a higher score with a
closer match?


 

-Original Message-
From: Binkley, Peter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 11, 2008 8:37 AM
To: solr-user@lucene.apache.org
Subject: RE: Accented search

We've done this in a pre-Solr Lucene context by using the position
increment: when a token contains accented characters, you add a stripped
version of that token with a zero increment, so that for matching purposes
the original and the stripped version are at the same position. Accents are
not stripped from queries. The effect is that an accented search matches
your Doc A, and an unaccented search matches Docs A and B. We do that after
lower-casing the token.

There are some limitations: users might start to expect that they can freely
add accents to restrict their search to accented hits, but if they don't
match the accents exactly they won't get any hits: e.g. if a word contains
two accented characters and the user only accents one of them in their
query, they won't match the accented or the unaccented version. 

Peter

Peter Binkley
Digital Initiatives Technology Librarian Information Technology Services
4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [EMAIL PROTECTED]

~ The code is willing, but the data is weak. ~


-Original Message-
From: climbingrose [mailto:[EMAIL PROTECTED]
Sent: Monday, March 10, 2008 10:01 PM
To: solr-user@lucene.apache.org
Subject: Accented search

Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to
hear some ideas about how to use Solr with those languages. Basically, I
want to achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:L?p Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters L?p Trình Viên, then Doc B is also matched and L?p
Trình Viên is highlighted.
  On the other hand, if the query is Lap Trinh Vien, Doc A is also
matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters L?p Trình Viên, then Doc A should be given higher
score than DOC B.
  if the query is Lap Trinh Vien, Doc A should be given higher score.

Any ideas guys? Thanks in advance!

--
Regards,

Cuong Hoang




Re: Accented search

2008-03-11 Thread Walter Underwood
Generally, the accented version will have a higher IDF, so it
will score higher.

wunder

On 3/11/08 8:44 AM, Renaud Waldura [EMAIL PROTECTED]
wrote:

 Peter:
 
 Very interesting. To take care of the issue you mention, could you add
 multiple synonyms with progressively less accents?
 
 E.g. you'd index préférence as 4 tokens:
  préférence (unchanged)
  preférence (stripped one accent)
  préference (stripped the other accent)
  preference (stripped both accents)
 
 Or does it yield too many tokens to be useful?
 
 And how does this take care of scoring? Do you get a higher score with a
 closer match?
 
 
  
 
 -Original Message-
 From: Binkley, Peter [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, March 11, 2008 8:37 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Accented search
 
 We've done this in a pre-Solr Lucene context by using the position
 increment: when a token contains accented characters, you add a stripped
 version of that token with a zero increment, so that for matching purposes
 the original and the stripped version are at the same position. Accents are
 not stripped from queries. The effect is that an accented search matches
 your Doc A, and an unaccented search matches Docs A and B. We do that after
 lower-casing the token.
 
 There are some limitations: users might start to expect that they can freely
 add accents to restrict their search to accented hits, but if they don't
 match the accents exactly they won't get any hits: e.g. if a word contains
 two accented characters and the user only accents one of them in their
 query, they won't match the accented or the unaccented version.
 
 Peter
 
 Peter Binkley
 Digital Initiatives Technology Librarian Information Technology Services
 4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta
 Canada T6G 2J8
 Phone: (780) 492-3743
 Fax: (780) 492-9243
 e-mail: [EMAIL PROTECTED]
 
 ~ The code is willing, but the data is weak. ~
 
 
 -Original Message-
 From: climbingrose [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 10, 2008 10:01 PM
 To: solr-user@lucene.apache.org
 Subject: Accented search
 
 Hi guys,
 
 I'm running to some problems with accented (UTF-8) language. I'd love to
 hear some ideas about how to use Solr with those languages. Basically, I
 want to achieve what Google did with UTF-8 language.
 
 My requirements including:
 1) Accent insensitive search and proper highlighting:
   For example, we have 2 documents:
 
   Doc A (title:L?p Trình Viên)
   Doc B (title:Lap Trinh Vien)
 
   if the user enters L?p Trình Viên, then Doc B is also matched and L?p
 Trình Viên is highlighted.
   On the other hand, if the query is Lap Trinh Vien, Doc A is also
 matched.
 2) Assign proper scores to accented or non-accented searches:
   if the user enters L?p Trình Viên, then Doc A should be given higher
 score than DOC B.
   if the query is Lap Trinh Vien, Doc A should be given higher score.
 
 Any ideas guys? Thanks in advance!
 
 --
 Regards,
 
 Cuong Hoang
 
 



Re: schema help

2008-03-11 Thread Otis Gospodnetic
Geoff,

I'm not sure if I understood your problem correctly, but it sounds like you 
want your search to be restricted to authors, but then you want to list all of 
his/her books when displaying results.  The easiest thing to do would be to 
create an index where each row/Document has the author name, the book title, 
etc.  For each author-matching Document you'd pull his/her books out of the 
result set.  Yes, this means the author name would be denormalized in 
RDBMS-speak.  Another option is not to index/store book titles, but rather have 
only an author index to search against.  The book data (mapped to author 
identities) would then be pulled from an external source (e.g. RDBMS: select 
title from books where author_id in (1,2,3)) at search results display time.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Geoffrey Young [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, March 11, 2008 12:17:32 PM
Subject: schema help

hi :)

I'm trying to work out a schema for our widgets.  more than just coming 
up with something I'd like something idiomatic in solr terms.  any help 
is much appreciated.  here's a similar problem space to what I'm working 
with...

lets say we're talking books.  books are written by authors and held in 
libraries.  a sister company is using lucene+compass and they seem to 
have completely different collections (or whatever the technical term is :)

   authors
   books
   libraries

so that a search for authors hits only the authors dataset.

all of the solr examples I can find don't seem to address this kind of 
data disparity.  what is the standard and idiomatic approach for solr?

for my particular data I'd want to display something like this

   author
 book in library
 book in library

on the same result page, but using a completely flat, single schema 
doesn't seem to scale very well.

collective widsom most welcome :)

--Geoff





Re: ranking on Multivalued fields

2008-03-11 Thread Otis Gospodnetic
Umar,

The notion of subfield does not exist in Solr (or am I living under a rock?).
Thus, field name=cat val 1 field name=catrank score1 /field /field 
doesn't really make sense.

Keep those two (cat and catrank) as two distinct fields and I think you'll have 
what you are after.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Umar Shah [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Saturday, March 8, 2008 7:03:32 AM
Subject: Re: ranking on Multivalued fields

Hi Otis,

thanks for the reply,

consider a multivalued field name cat
doc
--other fields

field name=cat val 1 field name=catrank score1 /field /field
field name=cat val 2 field name=catrank score2 /field /field
field name=cat val 3 field name=catrank score3 /field /field
..

--other fields

doc

the query i have to use is
q= cat:query-text; sort catrank desc

get all the documents
WITH field  cat HAVING  query-text
AND order by catrank desc

On 3/8/08, Otis Gospodnetic [EMAIL PROTECTED] wrote:

 Umar,

 I'm not sure what you mean by a subfield, can you explain please?

 As for your second question, just add category:X to your query and you'll
 get matches ordered/ranked by score by default.

 Otis


 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


 - Original Message 
 From: Umar Shah [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Friday, March 7, 2008 1:17:35 AM
 Subject: ranking on Multivalued fields

 Hi,

 I have a problem where i want to rank multivalued fields

 suppose a multivalued field category having associated subfield score.
 First Is it possible to have a subfield in the mutlivalued field?
 Second I want to get the documents ranked with the highest score say for
 the
 category:X

 thanks
 Umar Shah









Re: schema help

2008-03-11 Thread Geoffrey Young



Otis Gospodnetic wrote:

Geoff,

I'm not sure if I understood your problem correctly, but it sounds
like you want your search to be restricted to authors, but then you
want to list all of his/her books when displaying results. 


that's about right.  add that I may also want to search on libraries and 
show all the books (and authors) stored there.


in real life, it's not books or authors, of course, but the parallels 
are close enough :)  in fact, the library example is a good one for 
me... or at least a network of public libraries linked together.



The
easiest thing to do would be to create an index where each
row/Document has the author name, the book title, etc.  For each
author-matching Document you'd pull his/her books out of the result
set.  Yes, this means the author name would be denormalized in
RDBMS-speak.  


I think I can live with the denormalization - it seems lucene is flat 
and very different conceptually than a database :)


the trouble I'm having is one of dimension.  an author has many, many 
attributes (name, birthdate, biography in $language, etc).  as does each 
book (title in $language, summary in $language, genre, etc).  as does 
each library (name, address, directions in $language, etc).  so an 
author with N books doesn't seem to scale very well in the flat 
representations I'm finding in all the lucene/solr docs and examples... 
at least not in some way I can wrap my head around.


part of what seemed really appealing about lucene in general was that 
you could stuff all this (unindexed) information into a document and 
retrieve it all based on some search criteria.  but it's seeming very 
difficult for me to wrap my head around the data I need to represent.



Another option is not to index/store book titles, but
rather have only an author index to search against.  The book data
(mapped to author identities) would then be pulled from an external
source (e.g. RDBMS: select title from books where author_id in
(1,2,3)) at search results display time.


eew :)  seriously, though, that's what we have now - all rdbms driven. 
if solr could only conceptually handle the initial lookup there wouldn't 
be much point.


maybe I'm thinking about this all wrong (as is to be expected :), but I 
just can't believe that nobody is using solr to represent data a bit 
more complex than the examples out there.


thanks for the feedback.

--Geoff



Otis

-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message  From: Geoffrey Young
[EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent:
Tuesday, March 11, 2008 12:17:32 PM Subject: schema help

hi :)

I'm trying to work out a schema for our widgets.  more than just
coming up with something I'd like something idiomatic in solr terms.
any help is much appreciated.  here's a similar problem space to what
I'm working with...

lets say we're talking books.  books are written by authors and held
in libraries.  a sister company is using lucene+compass and they seem
to have completely different collections (or whatever the technical
term is :)

authors books libraries

so that a search for authors hits only the authors dataset.

all of the solr examples I can find don't seem to address this kind
of data disparity.  what is the standard and idiomatic approach for
solr?

for my particular data I'd want to display something like this

author book in library book in library

on the same result page, but using a completely flat, single schema 
doesn't seem to scale very well.


collective widsom most welcome :)

--Geoff




Re: schema help

2008-03-11 Thread Rachel McConnell
Our Solr use consists of several rather different data types, some of
which have one-to-many relationships with other types.  We don't need
to do any searching of quite the kind you describe, but I have an idea
about it, depending on what you need to do with the book data.  It is
rather hacky, but maybe you can improve it.

If you only need to present a list of books, possibly with links to
fuller data, you could do this:
* store only Authors in solr
* create a field, stored but not indexed (I may be using slightly
wrong terms here) which contains the short text representation of all
their books
* search on authors however you want and make sure you return this
field, and just display it as is

For example, if Jane Doe has written 2 books, How To Garden, and
Fields Of Maine, your special field might contain this:

a href=link/to/How-To-Garden/How To Garden/a published on DATE,
describes how to garden in Jane Doe's inimitable fashion.  She goes
into great depth 

a href=link/to/Fields-Of-Maine/Fields of Maine/a published on
DATE.  A brief overvew of Maine's woods and fields with special
attention to wildflowers

If your 'authors' 'write' 'books' with great frequency, you'd need to
update a lot...


Another possibility is to do two searches, with this kind of
structure, which sort of mimics an RDBMS:
* everything in Solr has a field, type (book, author, library, etc).
these can be filtered on a search by search basis
* books have a field, authorId, uniquely referencing the author
* your first search will restricted to just authors, from which you
will extract the IDs.
* your second search will be restricted to just books, whose authorId
field is exactly one of the IDs from the first search


As you have noticed, Lucene is not an RDBMS.  Searching through all
the text of all the books is more the use it was designed around; of
course the analogy might not be THAT strong with your need!

Rachel

On 3/11/08, Geoffrey Young [EMAIL PROTECTED] wrote:


  Otis Gospodnetic wrote:
   Geoff,
  
   I'm not sure if I understood your problem correctly, but it sounds
   like you want your search to be restricted to authors, but then you
   want to list all of his/her books when displaying results.


 that's about right.  add that I may also want to search on libraries and
  show all the books (and authors) stored there.

  in real life, it's not books or authors, of course, but the parallels
  are close enough :)  in fact, the library example is a good one for
  me... or at least a network of public libraries linked together.


   The
   easiest thing to do would be to create an index where each
   row/Document has the author name, the book title, etc.  For each
   author-matching Document you'd pull his/her books out of the result
   set.  Yes, this means the author name would be denormalized in
   RDBMS-speak.


 I think I can live with the denormalization - it seems lucene is flat
  and very different conceptually than a database :)

  the trouble I'm having is one of dimension.  an author has many, many
  attributes (name, birthdate, biography in $language, etc).  as does each
  book (title in $language, summary in $language, genre, etc).  as does
  each library (name, address, directions in $language, etc).  so an
  author with N books doesn't seem to scale very well in the flat
  representations I'm finding in all the lucene/solr docs and examples...
  at least not in some way I can wrap my head around.

  part of what seemed really appealing about lucene in general was that
  you could stuff all this (unindexed) information into a document and
  retrieve it all based on some search criteria.  but it's seeming very
  difficult for me to wrap my head around the data I need to represent.


   Another option is not to index/store book titles, but
   rather have only an author index to search against.  The book data
   (mapped to author identities) would then be pulled from an external
   source (e.g. RDBMS: select title from books where author_id in
   (1,2,3)) at search results display time.


 eew :)  seriously, though, that's what we have now - all rdbms driven.
  if solr could only conceptually handle the initial lookup there wouldn't
  be much point.

  maybe I'm thinking about this all wrong (as is to be expected :), but I
  just can't believe that nobody is using solr to represent data a bit
  more complex than the examples out there.

  thanks for the feedback.

  --Geoff


  
   Otis
  
   -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
   - Original Message  From: Geoffrey Young
   [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent:
   Tuesday, March 11, 2008 12:17:32 PM Subject: schema help
  
   hi :)
  
   I'm trying to work out a schema for our widgets.  more than just
   coming up with something I'd like something idiomatic in solr terms.
   any help is much appreciated.  here's a similar problem space to what
   I'm working with...
  
   lets say we're talking books. 

Re: Unparseable date

2008-03-11 Thread monkins

I indexed my docs with field : field
name=order_dt1995-12-31T23:59:59.000Z/field
But when i try to search on that field : order_dt:1995-12-31T23:59:59.000Z ,
I get an exception :
Mar 11, 2008 4:13:55 PM org.apache.solr.core.SolrException log
SEVERE: org.apache.solr.core.SolrException: Invalid Date
String:'1995-12-31T23'
at org.apache.solr.schema.DateField.toInternal(DateField.java:108)
at
org.apache.solr.schema.FieldType$DefaultAnalyzer$1.next(FieldType.java:298)
at
org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:437)
at
org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:78)
at
org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1092)
at
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:979)
at
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:907)
at
org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:896)
at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:146)

Am I missing anything ?

Thanks,
Monica.


Daniel Andersson-5 wrote:
 
 
 On Mar 5, 2008, at 11:08 PM, Chris Hostetter wrote:
 
 It's .000 not :00 ... 2008-02-12T15:02:06.000Z

 but like i said: that stack trace is odd, the time doesn't seem  
 like it
 actually comes from any query params, it looks like it's coming from a
 previously indexed doc.  To work arround this you may need to reindex
 all of your docs with those optional milliseconds.
 
 Ah, re-indexing now. Thanks for your help!
 
 / d
 
 

-- 
View this message in context: 
http://www.nabble.com/Unparseable-date-tp15854401p15994506.html
Sent from the Solr - User mailing list archive at Nabble.com.



Query Level Boosting

2008-03-11 Thread oleg_gnatovskiy

Hello. I was wondering if anyone knew a way to do query level boosting with
SolrJ. On the http client I could just do something like sku:123^2.3 which
would boost the sky query 2.3 points.
-- 
View this message in context: 
http://www.nabble.com/Query-Level-Boosting-tp15995005p15995005.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Out of memory in analysis

2008-03-11 Thread Chris Hostetter

: I pasted a modest blob of text into the analysis debug slot on the admin
: app, and am rewarded with this, even with -Xmx1g.

what was the text?  what was the field/fieldtype?  what did the 
analyzers for that fieldtype look like in your schema.xml?


-Hoss



Re: return only sorted Field, but with a different Field Name

2008-03-11 Thread Chris Hostetter
: 
: For example, say I want to sort by the field '162_sortable_s' then I add a
: parameter like so 'sort=162_sortable_s.' I need to change the settings so
: that when the result set is returned from solr, it takes the values of
: '162_sortable_s' and inserts them into a separate field called 'SortedField'
: so that the return doc looks like this:

there is nothing like this in solr right now, it doesn't seem like 
something that should be odne in solr, as it would be a simple translation 
that could be done via an XSLT or some client layer code.

: How or where do I change that setting? Do I have to rewrite some part of the
: RequestHandler?

assuming you didn't want to just use an XSLT, writing your own response 
writer that subclasses XmlResponseWriter would probably be the simplest 
way to accomplish this.



-Hoss



Re: How to get incrementPositionGap value from IndexSchema ?

2008-03-11 Thread Chris Hostetter

: I am looking for a way to access the incrementPositionGap value defined for a
: field type in the schema.xml.

I think you mean positionIncrementGap

It's a property of the fieldtype in schema.xml, but internally it's 
passed to SolrAnalyzer.setPositionIncrementGap.  if you want to 
programaticly know what the positionIncrementGap is for any analyzer of 
any field or fieldtype regardless of wether or not it's a SolrAnalyzer, 
just use Analzer.getPositionIncrementGap(String fieldName) 

ie: myFieldType.getAnalyzer().getPositionIncrementGap(myFieldName)


If you don't mind me asking:  why do you want/need this information in 
your custom code?


-Hoss



Re: Result based sorting for KWIC?

2008-03-11 Thread Chris Hostetter

: I am investigating using solr for a project that requires presentation of
: search results in a KWIC display, sorted according to either the string
: following the matches or the (reverse) of the characters previous to the
: matches.  Can this be done with Solr?  How would I go about implement this?

1) if you've got full text search, why would you even want KWIC?  

2) your description of how you'd want the results ordered is extrmely 
confusing to me ... can you give a simple concrete example of some 
documents / queries / result-doclists that you would want to see?



-Hoss



Re: Unparseable date

2008-03-11 Thread Chris Hostetter
: I indexed my docs with field : field
: name=order_dt1995-12-31T23:59:59.000Z/field
: But when i try to search on that field : order_dt:1995-12-31T23:59:59.000Z ,
: I get an exception :
: Mar 11, 2008 4:13:55 PM org.apache.solr.core.SolrException log
: SEVERE: org.apache.solr.core.SolrException: Invalid Date
: String:'1995-12-31T23'

: is a special character for the query parser, so it either needs to be 
escaped or the date needs to be quoted...

order_dt:1995-12-31T23:59:59.000Z

this isn't something most people typically need to worry about, because 
dates are typically only queried using ranges...

order_dt:[1995-12-31T23:59:59.000Z TO *]



-Hoss



Re: Accented search

2008-03-11 Thread climbingrose
Hi Peter,

It looks like a very promising approach for us. I'm going to implement an
custom Tokeniser based on your suggestions and see how it goes. Thank you
all for your comments!

Cheers

On Wed, Mar 12, 2008 at 2:37 AM, Binkley, Peter [EMAIL PROTECTED]
wrote:

 We've done this in a pre-Solr Lucene context by using the position
 increment: when a token contains accented characters, you add a stripped
 version of that token with a zero increment, so that for matching purposes
 the original and the stripped version are at the same position. Accents are
 not stripped from queries. The effect is that an accented search matches
 your Doc A, and an unaccented search matches Docs A and B. We do that after
 lower-casing the token.

 There are some limitations: users might start to expect that they can
 freely add accents to restrict their search to accented hits, but if they
 don't match the accents exactly they won't get any hits: e.g. if a word
 contains two accented characters and the user only accents one of them in
 their query, they won't match the accented or the unaccented version.

 Peter

 Peter Binkley
 Digital Initiatives Technology Librarian
 Information Technology Services
 4-30 Cameron Library
 University of Alberta Libraries
 Edmonton, Alberta
 Canada T6G 2J8
 Phone: (780) 492-3743
 Fax: (780) 492-9243
 e-mail: [EMAIL PROTECTED]

 ~ The code is willing, but the data is weak. ~


 -Original Message-
 From: climbingrose [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 10, 2008 10:01 PM
 To: solr-user@lucene.apache.org
 Subject: Accented search

 Hi guys,

 I'm running to some problems with accented (UTF-8) language. I'd love to
 hear some ideas about how to use Solr with those languages. Basically, I
 want to achieve what Google did with UTF-8 language.

 My requirements including:
 1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:Lập Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters Lập Trình Viên, then Doc B is also matched and Lập
 Trình Viên is highlighted.
  On the other hand, if the query is Lap Trinh Vien, Doc A is also
 matched.
 2) Assign proper scores to accented or non-accented searches:
  if the user enters Lập Trình Viên, then Doc A should be given higher
 score than DOC B.
  if the query is Lap Trinh Vien, Doc A should be given higher score.

 Any ideas guys? Thanks in advance!

 --
 Regards,

 Cuong Hoang




-- 
Regards,

Cuong Hoang


Cannot start solr

2008-03-11 Thread Vinci

I follow the tutorial on wiki but when I go to
http://server_address/solr/admin
I got tomcat error message:
HTTP 404

Then I go to check in Tomcat manager, I see it is not started, when I attend
to start it, I got this error message.

FAIL - Application at context path /solr could not be started

I am using tomcat 5.5 on debian, and I am placing the war file outside the
/webapps; also I copied everything under /example/solr to the path I pointed
to... I checked the file is here.

What I did wrong?
-- 
View this message in context: 
http://www.nabble.com/Cannot-start-solr-tp15997140p15997140.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Cannot start solr

2008-03-11 Thread Vinci

Additional Infomation:

2008/3/12 上午 11:10:54 org.apache.solr.core.SolrResourceLoader
locateInstanceDir
INFO: Using JNDI solr.home: /var/webapps/solr
2008/3/12 上午 11:10:54 org.apache.solr.servlet.SolrDispatchFilter init
INFO: looking for multicore.xml: /var/webapps/solr/multicore.xml
2008/3/12 上午 11:10:54 org.apache.solr.servlet.SolrDispatchFilter init
FATAL: Could not start SOLR. Check solr/home property
java.lang.ExceptionInInitializerError
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:104)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:221)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:302)
at
org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:78)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3635)
at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4222)
at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:760)
at
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:740)
at
org.apache.catalina.core.StandardHost.addChild(StandardHost.java:544)
at
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:626)
at
org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553)
at
org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488)
at
org.apache.catalina.startup.HostConfig.check(HostConfig.java:1206)
at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:293)
at
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120)
at
org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1306)
at
org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1570)
at
org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1579)
at
org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1559)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.RuntimeException: XPathFactory#newInstance() failed to
create an XPathFactory for the default object model:
http://java.sun.com/jaxp/xpath/dom with the
XPathFactoryConfigurationException:
javax.xml.xpath.XPathFactoryConfigurati...

2008/3/12 上午 11:10:54 org.apache.catalina.core.StandardContext start
FATAL: Error filterStart
2008/3/12 上午 11:10:54 org.apache.catalina.core.StandardContext start
FATAL: Context [/solr] startup failed due to previous errors


-
Related config:
solr locate in /var/webapps/solr

tree:
/var/webapps/solr/
|-- README.txt
|-- bin
|   |-- abc
|   |-- abo
|   |-- backup
|   |-- backupcleaner
|   |-- commit
|   |-- optimize
|   |-- readercycle
|   |-- rsyncd-disable
|   |-- rsyncd-enable
|   |-- rsyncd-start
|   |-- rsyncd-stop
|   |-- scripts-util
|   |-- snapcleaner
|   |-- snapinstaller
|   |-- snappuller
|   |-- snappuller-disable
|   |-- snappuller-enable
|   `-- snapshooter
`-- conf
|-- admin-extra.html
|-- elevate.xml
|-- protwords.txt
|-- schema.xml
|-- scripts.conf
|-- solrconfig.xml
|-- stopwords.txt
|-- synonyms.txt
`-- xslt
|-- example.xsl
|-- example_atom.xsl
|-- example_rss.xsl
`-- luke.xsl

solr.xml:
Context docBase=/usr/lib/solr/solr.war debug=0 crossContext=true 
Environment name=solr/home type=java.lang.String
value=/var/webapps/solr override=true /
/Context

Can anybody help me? I am not so familiar with tomcat...


Vinci wrote:
 
 I follow the tutorial on wiki but when I go to
 http://server_address/solr/admin
 I got tomcat error message:
 HTTP 404
 
 Then I go to check in Tomcat manager, I see it is not started, when I
 attend to start it, I got this error message.
 
 FAIL - Application at context path /solr could not be started
 
 I am using tomcat 5.5 on debian, and I am placing the war file outside the
 /webapps; also I copied everything under /example/solr to the path I
 pointed to... I checked the file is here.
 
 What I did wrong?
 

-- 
View this message in context: 
http://www.nabble.com/Cannot-start-solr-tp15997140p15997330.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: schema help

2008-03-11 Thread Otis Gospodnetic
Geoff, some comments inlined.

- Original Message 
From: Geoffrey Young [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, March 11, 2008 4:55:15 PM
Subject: Re: schema help



Otis Gospodnetic wrote:
 Geoff,
 
 I'm not sure if I understood your problem correctly, but it sounds
 like you want your search to be restricted to authors, but then you
 want to list all of his/her books when displaying results. 

that's about right.  add that I may also want to search on libraries and 
show all the books (and authors) stored there.

OG: That's fine.  One page (of results) at a time, I imagine.

in real life, it's not books or authors, of course, but the parallels 
are close enough :)  in fact, the library example is a good one for 
me... or at least a network of public libraries linked together.

 The
 easiest thing to do would be to create an index where each
 row/Document has the author name, the book title, etc.  For each
 author-matching Document you'd pull his/her books out of the result
 set.  Yes, this means the author name would be denormalized in
 RDBMS-speak.  

I think I can live with the denormalization - it seems lucene is flat 
and very different conceptually than a database :)

OG: Right, it is. :)

the trouble I'm having is one of dimension.  an author has many, many 
attributes (name, birthdate, biography in $language, etc).  as does each 
book (title in $language, summary in $language, genre, etc).  as does 
each library (name, address, directions in $language, etc).  so an 
author with N books doesn't seem to scale very well in the flat 
representations I'm finding in all the lucene/solr docs and examples... 
at least not in some way I can wrap my head around.

OG: I'm not sure why the number of attributes worries you.  Imagine is as a 
wide RDBMS table, if it helps.  Indices with dozens of fields are not uncommon.

part of what seemed really appealing about lucene in general was that 
you could stuff all this (unindexed) information into a document and 
retrieve it all based on some search criteria.  but it's seeming very 
difficult for me to wrap my head around the data I need to represent.

OG: You certainly can do that.  I'm not sure I understand where the hard part 
is.  You seem to know what attributes each entity has.  Maybe you are confused 
by how to handle N different types of entities in a single index? (I'm assuming 
a single index is what you currently have in mind)

 Another option is not to index/store book titles, but
 rather have only an author index to search against.  The book data
 (mapped to author identities) would then be pulled from an external
 source (e.g. RDBMS: select title from books where author_id in
 (1,2,3)) at search results display time.

eew :)  seriously, though, that's what we have now - all rdbms driven. 
if solr could only conceptually handle the initial lookup there wouldn't 
be much point.

OG: Well, there might or might not be, depending on how much data you have, how 
flexible and fast your RDBMS-powered (full-text?) search, and so on.  The 
Lucene/Solr for full-text search + RDBMS/BDB for display data is a common 
combination.

maybe I'm thinking about this all wrong (as is to be expected :), but I 
just can't believe that nobody is using solr to represent data a bit 
more complex than the examples out there.

OG: Oh, lots of people are, it's just that examples are simple, so people new 
to Solr, Lucene, etc. have easier time learning.

Otis 
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


 
 Otis
 
 -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message  From: Geoffrey Young
 [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent:
 Tuesday, March 11, 2008 12:17:32 PM Subject: schema help
 
 hi :)
 
 I'm trying to work out a schema for our widgets.  more than just
 coming up with something I'd like something idiomatic in solr terms.
 any help is much appreciated.  here's a similar problem space to what
 I'm working with...
 
 lets say we're talking books.  books are written by authors and held
 in libraries.  a sister company is using lucene+compass and they seem
 to have completely different collections (or whatever the technical
 term is :)
 
 authors books libraries
 
 so that a search for authors hits only the authors dataset.
 
 all of the solr examples I can find don't seem to address this kind
 of data disparity.  what is the standard and idiomatic approach for
 solr?
 
 for my particular data I'd want to display something like this
 
 author book in library book in library
 
 on the same result page, but using a completely flat, single schema 
 doesn't seem to scale very well.
 
 collective widsom most welcome :)
 
 --Geoff
 
 





Solr nightly build and the multicore mode

2008-03-11 Thread Vinci

Hi all,

after tracing log, I found the tomcat problem with nightly build is the
multicore.xml on nightly build - if the multicore.xml doesn't exist, it
won't run the application like jetty does (run in single core mode if file
doesn't exist)

Q1. I don't know how to set the path...WHERE should I put the core1 and
core0 folder? somewhare in the solr/home or somewhere in webapps?, and make
the admin panel working?

Q2 how can I disable the multicore function when multicore.xml exist? just
remove the second core?

Thank you for any reply
-- 
View this message in context: 
http://www.nabble.com/Solr-nightly-build-and-the-multicore-mode-tp15997822p15997822.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Result based sorting for KWIC?

2008-03-11 Thread Christian Wittern

Chris Hostetter wrote:
1) if you've got full text search, why would you even want KWIC?  
  
Well, KWIC is a way to present the full text search results so that they 
can be easily read. 
2) your description of how you'd want the results ordered is extrmely 
confusing to me ... can you give a simple concrete example of some 
documents / queries / result-doclists that you would want to see?
  
If you go to http://tkb.mydns.jp:8899/exist/rest/db/new/tkb.xq you will 
see what I currently have.  Just click search to search for the example, 
or maybe delete the last character so that you get more results (this is 
not released yet, so don't be surprised if it breaks...). 
You will see the search term highlighted in the middle, context is 
available from the blue arrow to the right.  The display would be much 
more useful for the users, if this could be sorted on the characters 
following the hit (ignoring punctuation).  Another option would be to 
sort on the characters previous to the hit.  But in this case, the 
sorting has to be reversed, so that if I have:

 ABCDhitFGHI
the sort-key would be constructed as DCBA for this case.

I know that this can be done by post-processing the results on the 
client (which is what Erik suggested offline), but if I get thousands of 
hits, that would be very slow, so I am looking for other ways.   Erik 
also said that down the road there might be a sort function that could 
be called, which is what I would need here. 


Cheers,

Christian

--
Christian Wittern 
Institute for Research in Humanities, Kyoto University

47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN