RE: lucene 4.0 release date

2010-11-07 Thread Uwe Schindler
You have to also use Solr 4.0 :-)

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Li Li [mailto:fancye...@gmail.com]
> Sent: Monday, November 08, 2010 8:47 AM
> To: dev@lucene.apache.org; simon.willna...@gmail.com
> Subject: Re: lucene 4.0 release date
> 
> thank you.
> so if I want to use new compress/decompress algorithm, I must use
lucene
> 4.0 in svn? Is there any patch for old release such as 2.9?because I need
solr 1.4
> which based on lucene 2.9
> 
> 2010/11/8 Simon Willnauer :
> > Li Li,
> >
> > there is no official / unofficial release date for lucene 4.0 if you
> > want to use the latest and greatest features you need to checkout
> > trunk of use a nightly build. My guess would be that there is at least
> > 6 to 8 month to the next release but I can be wrong (more likely it
> > might take even longer) .
> >
> > For PFoR etc you should look into:
> > https://issues.apache.org/jira/browse/LUCENE-1410
> > https://issues.apache.org/jira/browse/LUCENE-2723
> >
> > to get started - and read mikes blog
> > http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-c
> > odec.html
> >
> > There is also S9
> > https://issues.apache.org/jira/browse/LUCENE-2189
> > and GroupVInt impls
> > https://issues.apache.org/jira/browse/LUCENE-2735
> >
> > simon
> >
> > On Mon, Nov 8, 2010 at 4:59 AM, Li Li  wrote:
> >> hi all,
> >>    When will lucene 4.0 be released?
> >>    I want to replace VInt compression with fast ones such as
> >> PForDelta. In my application, decompressing a docList of 10M will use
> >> about 300ms. In  "Performance of Compressed Inverted List Caching in
> >> Search Engines. With J. Zhang and X.Long. 17th International World
> >> Wide Web Conference (WWW), April 2008." , the author says PForDelta
> >> is much faster than VInt. And I also found a java implementation in
> >> http://code.google.com/p/integer-array-compress-kit/ it's speed is
> >> 500 (M int / sec). But to achieve, I have to modify index file
> >> format. And I found
> >> http://wiki.apache.org/lucene-java/FlexibleIndexing  in lucene 4.0,
> >> it will support fore flexible index format. I want to know when it
> >> will be released so as to decide whether wait it or doing it myself.
Thank
> you.
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
> >> additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> >>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
> > additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
> commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: lucene 4.0 release date

2010-11-07 Thread Li Li
thank you.
so if I want to use new compress/decompress algorithm, I must use
lucene 4.0 in svn? Is there any patch for old release such as
2.9?because I need solr 1.4 which based on lucene 2.9

2010/11/8 Simon Willnauer :
> Li Li,
>
> there is no official / unofficial release date for lucene 4.0 if you
> want to use the latest and greatest features you need to checkout
> trunk of use a nightly build. My guess would be that there is at least
> 6 to 8 month to the next release but I can be wrong (more likely it
> might take even longer) .
>
> For PFoR etc you should look into:
> https://issues.apache.org/jira/browse/LUCENE-1410
> https://issues.apache.org/jira/browse/LUCENE-2723
>
> to get started - and read mikes blog
> http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html
>
> There is also S9
> https://issues.apache.org/jira/browse/LUCENE-2189
> and GroupVInt impls
> https://issues.apache.org/jira/browse/LUCENE-2735
>
> simon
>
> On Mon, Nov 8, 2010 at 4:59 AM, Li Li  wrote:
>> hi all,
>>    When will lucene 4.0 be released?
>>    I want to replace VInt compression with fast ones such as
>> PForDelta. In my application, decompressing a docList of 10M will use
>> about 300ms. In  "Performance of Compressed Inverted List Caching in
>> Search Engines. With J. Zhang and X.Long. 17th International World
>> Wide Web Conference (WWW), April 2008." , the author says PForDelta is
>> much faster than VInt. And I also found a java implementation in
>> http://code.google.com/p/integer-array-compress-kit/ it's speed is 500
>> (M int / sec). But to achieve, I have to modify index file format. And
>> I found http://wiki.apache.org/lucene-java/FlexibleIndexing  in lucene
>> 4.0, it will support fore flexible index format. I want to know when
>> it will be released so as to decide whether wait it or doing it
>> myself. Thank you.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: lucene 4.0 release date

2010-11-07 Thread Simon Willnauer
Li Li,

there is no official / unofficial release date for lucene 4.0 if you
want to use the latest and greatest features you need to checkout
trunk of use a nightly build. My guess would be that there is at least
6 to 8 month to the next release but I can be wrong (more likely it
might take even longer) .

For PFoR etc you should look into:
https://issues.apache.org/jira/browse/LUCENE-1410
https://issues.apache.org/jira/browse/LUCENE-2723

to get started - and read mikes blog
http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html

There is also S9
https://issues.apache.org/jira/browse/LUCENE-2189
and GroupVInt impls
https://issues.apache.org/jira/browse/LUCENE-2735

simon

On Mon, Nov 8, 2010 at 4:59 AM, Li Li  wrote:
> hi all,
>    When will lucene 4.0 be released?
>    I want to replace VInt compression with fast ones such as
> PForDelta. In my application, decompressing a docList of 10M will use
> about 300ms. In  "Performance of Compressed Inverted List Caching in
> Search Engines. With J. Zhang and X.Long. 17th International World
> Wide Web Conference (WWW), April 2008." , the author says PForDelta is
> much faster than VInt. And I also found a java implementation in
> http://code.google.com/p/integer-array-compress-kit/ it's speed is 500
> (M int / sec). But to achieve, I have to modify index file format. And
> I found http://wiki.apache.org/lucene-java/FlexibleIndexing  in lucene
> 4.0, it will support fore flexible index format. I want to know when
> it will be released so as to decide whether wait it or doing it
> myself. Thank you.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Antw.: Solr-3.x - Build # 160 - Failure

2010-11-07 Thread Uwe Schindler
No updates on the Hudson issue until now. What should we do? Disable Clover 
report generation for now?

I have no idea, what else we could do.

Uwe

---
Uwe Schindler
Generics Policeman
Bremen, Germany

- Reply message -
Von: "Apache Hudson Server" 
Datum: Mo., Nov. 8, 2010 06:55
Betreff: Solr-3.x - Build # 160 - Failure
An: 

Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/

All tests passed

Build Log (for compile errors):
[...truncated 18776 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




Solr-3.x - Build # 160 - Failure

2010-11-07 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/

All tests passed

Build Log (for compile errors):
[...truncated 18776 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



lucene 4.0 release date

2010-11-07 Thread Li Li
hi all,
When will lucene 4.0 be released?
I want to replace VInt compression with fast ones such as
PForDelta. In my application, decompressing a docList of 10M will use
about 300ms. In  "Performance of Compressed Inverted List Caching in
Search Engines. With J. Zhang and X.Long. 17th International World
Wide Web Conference (WWW), April 2008." , the author says PForDelta is
much faster than VInt. And I also found a java implementation in
http://code.google.com/p/integer-array-compress-kit/ it's speed is 500
(M int / sec). But to achieve, I have to modify index file format. And
I found http://wiki.apache.org/lucene-java/FlexibleIndexing  in lucene
4.0, it will support fore flexible index format. I want to know when
it will be released so as to decide whether wait it or doing it
myself. Thank you.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-trunk - Build # 1356 - Failure

2010-11-07 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1356/

All tests passed

Build Log (for compile errors):
[...truncated 18288 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (SOLR-2221) DIH: use StrUtils.parseBool() to get values of boolean options of import command

2010-11-07 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved SOLR-2221.
--

Resolution: Fixed

trunk: Committed revision 1032446.
branch_3x: Committed revision 1032451.

> DIH: use StrUtils.parseBool() to get values of boolean options of import 
> command
> 
>
> Key: SOLR-2221
> URL: https://issues.apache.org/jira/browse/SOLR-2221
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 1.3, 1.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-2221.patch
>
>
> Currently, debug option for full-import/delta-import accepts only "on" for 
> true:
> {code}
> if ("on".equals(requestParams.get("debug"))) {
>   debug = true;
>   rows = 10;
>   // Set default values suitable for debug mode
>   commit = false;
>   clean = false;
>   verbose = "true".equals(requestParams.get("verbose"))
>   || "on".equals(requestParams.get("verbose"));
> }
> {code}
> and other boolean options uses Boolean.parseBoolean(String). We would like to 
> use StrUtils.parseBool() that accepts true, on and yes for true, and false, 
> off and no for false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Assigned: (SOLR-2221) DIH: use StrUtils.parseBool() to get values of boolean options of import command

2010-11-07 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned SOLR-2221:


Assignee: Koji Sekiguchi

> DIH: use StrUtils.parseBool() to get values of boolean options of import 
> command
> 
>
> Key: SOLR-2221
> URL: https://issues.apache.org/jira/browse/SOLR-2221
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 1.3, 1.4
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Trivial
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-2221.patch
>
>
> Currently, debug option for full-import/delta-import accepts only "on" for 
> true:
> {code}
> if ("on".equals(requestParams.get("debug"))) {
>   debug = true;
>   rows = 10;
>   // Set default values suitable for debug mode
>   commit = false;
>   clean = false;
>   verbose = "true".equals(requestParams.get("verbose"))
>   || "on".equals(requestParams.get("verbose"));
> }
> {code}
> and other boolean options uses Boolean.parseBoolean(String). We would like to 
> use StrUtils.parseBool() that accepts true, on and yes for true, and false, 
> off and no for false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-07 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

Here's an uncleaned up cut with all tests passing. I nulled out
the lastSegmentInfo on abort which fixes the my own assertion
that was causing the rollback tests to not pass. I don't know if
this is cheating or not yet just to get the tests to pass.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (SOLR-1973) Empty fields in update messages confuse DataImportHandler

2010-11-07 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved SOLR-1973.
--

Resolution: Fixed

trunk: Committed revision 1032433.
branch_3x: Committed revision 1032438.

> Empty fields in update messages confuse DataImportHandler
> -
>
> Key: SOLR-1973
> URL: https://issues.apache.org/jira/browse/SOLR-1973
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4, 1.4.1
> Environment: CentOS 5, Java 1.6, Tomcat 6
>Reporter: Sixten Otto
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-1973-test.patch, SOLR-1973.patch, SOLR-1973.patch
>
>
> I seem to be running into an issue with Solr (maybe just the 
> DataImportHandler?) not liking empty field elements in the docs, and getting 
> the wrong values into the fields of the index. Here's the entity declaration 
> from data-config.xml for my isolated example:
>  
>dataSource="xml"
> processor="XPathEntityProcessor"
> stream="true"
> url="http://example.com/Content.xml";
> useSolrAddSchema="true">
>   
>  
> And here's the Content.xml being pulled in by the DIH:
>  
>   
> 
> Lorem Ipsum Dolor
> Some content is me!
>   
>  
> And here's the relevant portion of the output from the DIH in debug mode:
>  
>   
> http://example.com/Content.xml
>   
>   0:0:0.6
>   --- row #1-
>   Some content is me!
>   Lorem Ipsum Dolor
>   -
>  
> Notice that the field "full" doesn't appear here, but the following field 
> "empty" has the content that was there for "full". The "other" field, which 
> was non-empty, and preceded by a non-empty field, shows up correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-07 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929424#action_12929424
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

In DW abort (called by IW rollbackInternal) we should be able to simply clear 
all per segment pending deletes, however, I'm not sure we can do that, in fact, 
if we have applied deletes for a merge, then we rollback, we can't undo those 
deletes thereby breaking our current rollback model?

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2029) Support for Index Time Document Boost in SolrContentHandler

2010-11-07 Thread Jayendra Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayendra Patil updated SOLR-2029:
-

Description: 
We are using the extract request handler to index rich content documents with 
other metadata.
However, SolrContentHandler does not seem to support the parameter for applying 
index time document boost. 
Basically, including document.setDocumentBoost(boost).

  was:
We are using the extract request handler to index rich content documents with 
other metadata.
However, SolrContentHandler does seem to support the parameter for applying 
index time document boost. 
Basically, including document.setDocumentBoost(boost).


> Support for Index Time Document Boost in SolrContentHandler
> ---
>
> Key: SOLR-2029
> URL: https://issues.apache.org/jira/browse/SOLR-2029
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 1.4.1
>Reporter: Jayendra Patil
> Attachments: SolrContentHandler.patch
>
>
> We are using the extract request handler to index rich content documents with 
> other metadata.
> However, SolrContentHandler does not seem to support the parameter for 
> applying index time document boost. 
> Basically, including document.setDocumentBoost(boost).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-07 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

Everything passes, except for tests that involve IW rollback.  We need to be 
able to rollback the last segment info/index in DW, however I'm not sure how we 
want to do that quite yet.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 1107 - Failure

2010-11-07 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1107/

3 tests failed.
FAILED:  
junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64

Error Message:
Java heap space

Stack Trace:
java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.store.RAMFile.newBuffer(RAMFile.java:89)
at org.apache.lucene.store.RAMFile.addBuffer(RAMFile.java:62)
at 
org.apache.lucene.store.RAMOutputStream.switchCurrentBuffer(RAMOutputStream.java:132)
at 
org.apache.lucene.store.RAMOutputStream.copyBytes(RAMOutputStream.java:171)
at 
org.apache.lucene.store.MockIndexOutputWrapper.copyBytes(MockIndexOutputWrapper.java:134)
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java:222)
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:188)
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:212)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4182)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3659)
at 
org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:37)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2566)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2387)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2346)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2316)
at 
org.apache.lucene.index.RandomIndexWriter.close(RandomIndexWriter.java:145)
at 
org.apache.lucene.search.TestNumericRangeQuery64.beforeClass(TestNumericRangeQuery64.java:92)


FAILED:  
junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64

Error Message:
MockDirectoryWrapper: cannot close: there are still open files: {_3f.pst=1}

Stack Trace:
java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still 
open files: {_3f.pst=1}
at 
org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:343)
at 
org.apache.lucene.search.TestNumericRangeQuery64.afterClass(TestNumericRangeQuery64.java:101)
Caused by: java.lang.RuntimeException: unclosed IndexInput
at 
org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:300)
at 
org.apache.lucene.index.codecs.simpletext.SimpleTextFieldsReader.(SimpleTextFieldsReader.java:57)
at 
org.apache.lucene.index.codecs.simpletext.SimpleTextCodec.fieldsProducer(SimpleTextCodec.java:53)
at 
org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReader.java:136)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:536)
at 
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:626)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4114)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3659)
at 
org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:37)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2566)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2387)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2346)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2316)
at 
org.apache.lucene.index.RandomIndexWriter.close(RandomIndexWriter.java:145)
at 
org.apache.lucene.search.TestNumericRangeQuery64.beforeClass(TestNumericRangeQuery64.java:92)


FAILED:  
junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64

Error Message:
directory of test was not closed, opened from: 
org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:653)

Stack Trace:
junit.framework.AssertionFailedError: directory of test was not closed, opened 
from: 
org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:653)
at 
org.apache.lucene.util.LuceneTestCase.afterClassLuceneTestCaseJ4(LuceneTestCase.java:331)




Build Log (for compile errors):
[...truncated 3113 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2029) Support for Index Time Document Boost in SolrContentHandler

2010-11-07 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929421#action_12929421
 ] 

Lance Norskog commented on SOLR-2029:
-

+1

Nice!

> Support for Index Time Document Boost in SolrContentHandler
> ---
>
> Key: SOLR-2029
> URL: https://issues.apache.org/jira/browse/SOLR-2029
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 1.4.1
>Reporter: Jayendra Patil
> Attachments: SolrContentHandler.patch
>
>
> We are using the extract request handler to index rich content documents with 
> other metadata.
> However, SolrContentHandler does seem to support the parameter for applying 
> index time document boost. 
> Basically, including document.setDocumentBoost(boost).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Rethinking spatial implementation

2010-11-07 Thread Grant Ingersoll

On Nov 6, 2010, at 5:23 PM, Christopher Schmidt wrote:

> Hi Ryan, thx for your answer.
> 
> You mean there is room for improvement and volunteers?

We've been looking at replacing it with the Military Grid system.  The primary 
issue with the current is that the Sinusoidal projection is broken which then 
breaks almost all the tests.  I worked on it for a while trying to straighten 
it out, but gave up and now think it is easier to implement clean.  I 
definitely would like to see a tier/grid implementation.


> 
> On Friday, November 5, 2010, Ryan McKinley  wrote:
>> Hi Christopher -
>> 
>> I do not believe there is any active work on this.  From what I
>> understand, the Tier implementation works OK within some constraints,
>> but we could not get it to pass more robust testing that the other
>> methods were using.
>> 
>> However, LatLonType and GeoHashField are well tested and work well --
>> the Tier type may have better performance when your index is really
>> large, but no active developers understand it and no-one has stepped
>> up to figure it out.
>> 
>> ryan
>> 
>> 
>> On Wed, Nov 3, 2010 at 3:16 PM, Christopher Schmidt
>>  wrote:
>>> Hi all,
>>> I saw a mail thread "Rethinking Cartesian Tiers implementation" (here).
>>> Is there any work in progress regarding this? If yes, is the current
>>> implementation deprecated or do you plan some enhancements (other
>>> projections or spatial indexes) ?
>>> I am asking because I want to use Lucene's spatial indexing in a production
>>> system...
>>> 
>>> --
>>> Christopher
>>> twitter: @fakod
>>> blog: http://blog.fakod.eu
>>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
>> 
> 
> -- 
> Christopher
> twitter: @fakod
> blog: http://blog.fakod.eu
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929408#action_12929408
 ] 

Steven Rowe commented on LUCENE-2167:
-

bq. Why don't we teach StandardTokenizer to produce tokens for separator chars?

I've been thinking about this - the word break rules in UAX#29 are intended for 
use in break iterators, and tokenizers take that one step further by discarding 
stuff between some breaks.

StandardTokenizer is faster, though, since it doesn't have to tokenize the 
stuff between tokens, so if we go down this route, I think it should go 
somewhere else: UAX29WordBreakSegmenter or something like that.

I'd like to have (nestable) SentenceSegmenter, ParagraphSegmenter, etc., the 
output from which could be the input to tokenizers.

> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929407#action_12929407
 ] 

Earwin Burrfoot commented on LUCENE-2167:
-

bq. Would it somehow be possible to allow multiple Tokenizers to work together?
The fact that Tokenizers now are not TokenFilters bugs me somewhat.
In theory, you should just feed initial text as a single monster token from 
hell into analysis chain, and then you only have TokenFilters, none/one/some of 
which might split this token.
If there are no TokenFilters at all, you get a NOT_ANALYZED case without extra 
flags, yahoo!

The only problem here is the need for ability to wrap arbitrary Reader in a 
TermAttribute :/

bq. But (yay repetition!) if the tokenizer throws away the separator chars, 
URLs can't be reassembled from their parts.
Why don't we teach StandardTokenizer to produce tokens for separator chars?
A special filter at the end of the chain will drop them, so they won't get into 
index.
And in the midst of the filter chain you are free to do whatever you want with 
them - detect emails/urls/sentences/whatever.

> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929405#action_12929405
 ] 

Steven Rowe commented on LUCENE-2745:
-

{quote}
one solution, add a charfilter that maps zwnj to space for persiananalyzer?

this way, it could use uax29 and support numerics etc
{quote}

I like it - it sounds better than my other idea: a configurable token splitting 
filter.

> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> --
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
>Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> a...@hotmail.com
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [a...@hotmail.com]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-07 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

I placed (for now) the segment deletes directly into the segment info object.  
There's applied term/queries sets which are checked against when apply deletes 
all is called.  All tests pass except for TestTransactions and 
TestPersistentSnapshotDeletionPolicy only because of an assertion check I 
added, that the last segment info is in fact in the newly pushed segment infos. 
 I think in both cases segment infos is being altered in IW in a place where 
the segment infos isn't being pushed, yet.  I wanted to checkpoint this though 
as it's a fairly well working at this point, including the last segment 
info/index, which is can be turned on or off via a static variable.  

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929403#action_12929403
 ] 

Robert Muir commented on LUCENE-2745:
-

yes, sorry for the confusion!

one solution, add a charfilter that maps zwnj to space for persiananalyzer?

this way, it could use uax29 and support numerics etc

> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> --
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
>Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> a...@hotmail.com
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [a...@hotmail.com]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 1101 - Failure

2010-11-07 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1101/

1 tests failed.
REGRESSION:  org.apache.solr.TestDistributedSearch.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:437)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




Build Log (for compile errors):
[...truncated 8745 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929400#action_12929400
 ] 

Steven Rowe commented on LUCENE-2745:
-

bq. Hunh

Okay, I think I get it now.  

I did a search for U+200C in the whole Lucene project, and I found 
TestPersianAnalyzer.

Apparently, Robert, when you said "the whole analyzer" and "this approach" you 
meant PersianAnalyzer, rather than ArabicAnalyzer.  Sorry for the confusion.

What do you think the approach should be for Persian?  Maybe a 
StandardTokenizer clone that excludes ZWNJ from the \p{Word_Break:Extend} class 
that gets added to every rule?  I'll see if there is some way to compose a 
PersianTokenizer.jflex (using the %include directive maybe?) using 
StandardTokenizerImpl.jflex, so that we don't end up with code duplication.

> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> --
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
>Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> a...@hotmail.com
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [a...@hotmail.com]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-792) Pivot (ie: Decision Tree) Faceting Component

2010-11-07 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929401#action_12929401
 ] 

Toke Eskildsen commented on SOLR-792:
-

The current interface does not allow for nested queries. It is my understanding 
that this limits the functionality to conventional hierarchical faceting with 
the slight twist that the counts are for the current level instead of current 
level + sub levels, but that should be attainable with conventional 
hierarchical faceting too. This makes current pivot faceting a sub-set of 
SOLR-64, provided that SOLR-64 is adjusted to accept a list of fields as 
building blocks instead of expressing the hierarchy in a single field with 
delimiters. This is a good thing. It means that it can be done fast and 
memory-efficient as well as sharing most of the interface and output format 
with SOLR-64.

Now, if something like nested queries is introduced in the pivot faceting 
interface, this changes the requirements of the underlying code as a complete 
recount is needed for each level. One evil nested query could be "Select the 
documents where field X contains the last letter of the current tag plus the 
first letter of the original query". This makes it hard (I try and avoid using 
the word "impossible") to create an implementation without query-explosion.

So where am I going with all this? My point is that the interface (of course) 
dictates how responsive the implementation can be. Focusing on interfaces and 
using small-scale test data does carry a risk of ending up with something that 
is inherently slow. It might be unfeasible to attain high scalability with a 
given interface addition and that is okay - as long as that cost is known and 
accepted. Hence my questions about scale and my musings about how to do it 
faster.

> Pivot (ie: Decision Tree) Faceting Component
> 
>
> Key: SOLR-792
> URL: https://issues.apache.org/jira/browse/SOLR-792
> Project: Solr
>  Issue Type: New Feature
>Reporter: Erik Hatcher
>Assignee: Yonik Seeley
>Priority: Minor
> Attachments: SOLR-792-as-helper-class.patch, 
> SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
> SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
> SOLR-792-raw-type.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, 
> SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch
>
>
> A component to do multi-level faceting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929396#action_12929396
 ] 

Steven Rowe commented on LUCENE-2745:
-

{quote}
this is how the whole analyzer works, more examples in
the tests... I can give you more refs later, when I have
better bandwidth... but its specific to this language.
we shouldn't split on it in general... also often a real
space is used instead, so this approach is the simplest
for the language
{quote}

AFAICT, ArabicLetterTokenizer just adds non-spacing marks to the list of 
acceptable token characters, so they won't be used to split words.  However, 
ZWNJ (U+200C) has the "Cf" -- Format -- general category, *not* the "Mn" 
general category (non-spacing marks), so as far as I can tell, the current 
Lucene ArabicLetterTokenizer (and hence ArabicAnalyzer) splits on ZWNJ.

None of the tests in TestArabicLetterTokenizer nor in TestArabicAnalyzer 
contain ZWNJ (U+200C).

Maybe what I'm not understanding is "this approach" in your quote above.  Can 
you describe "this approach"?

When you wrote "we split on this and the affixes are in the stoplist" did you 
mean that ArabicLetterTokenizer *intentionally* breaks Persian words at ZWNJ?  
And then throws away the affixes that result?  Hunh


> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> --
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
>Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> a...@hotmail.com
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [a...@hotmail.com]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2029) Support for Index Time Document Boost in SolrContentHandler

2010-11-07 Thread Jayendra Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayendra Patil updated SOLR-2029:
-

Attachment: SolrContentHandler.patch

Attached is the Fix Patch.
The parameter name to be passed is boost.

> Support for Index Time Document Boost in SolrContentHandler
> ---
>
> Key: SOLR-2029
> URL: https://issues.apache.org/jira/browse/SOLR-2029
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 1.4.1
>Reporter: Jayendra Patil
> Attachments: SolrContentHandler.patch
>
>
> We are using the extract request handler to index rich content documents with 
> other metadata.
> However, SolrContentHandler does seem to support the parameter for applying 
> index time document boost. 
> Basically, including document.setDocumentBoost(boost).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 1098 - Failure

2010-11-07 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1098/

1 tests failed.
REGRESSION:  org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize

Error Message:
null

Stack Trace:
junit.framework.AssertionFailedError: 
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.lucene.index.TestThreadedOptimize.runTest(TestThreadedOptimize.java:127)
at 
org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize(TestThreadedOptimize.java:147)




Build Log (for compile errors):
[...truncated 3084 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929392#action_12929392
 ] 

Steven Rowe commented on LUCENE-2745:
-

bq. steven, check out the link at the bottom of that article.

Yup, did that.

bq. especially the top... it explains the use in the language, particularly to 
block cursive joining for prefixes, suffixes, compounds. we split on this and 
the affixes are in the stoplist 

Um, like I said, Persian uses ZWNJs as display hints, not as word separators.

According to the [ICU web 
demo|http://demo.icu-project.org/icu-bin/ubrowse?go=200C], ZWNJs have the 
\p{Word_Break:Extend} property, so the Lucene UAX#29-based tokenizers will 
*not* split on this char.

What am I not getting?

> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> --
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
>Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> a...@hotmail.com
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [a...@hotmail.com]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929389#action_12929389
 ] 

Robert Muir commented on LUCENE-2745:
-

steven, check out the link at the bottom of that article.
especially the top... it explains the use in the language,
particularly to block cursive joining for prefixes, suffixes,
compounds. we split on this and the affixes are in the stoplist

this is how the whole analyzer works, more examples in
the tests... I can give you more refs later, when I have
better bandwidth... but its specific to this language.
we shouldn't split on it in general... also often a real
space is used instead, so this approach is the simplest
for the language

> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> --
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
>Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> a...@hotmail.com
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [a...@hotmail.com]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929386#action_12929386
 ] 

Steven Rowe commented on LUCENE-2745:
-

bq. the only trick to deprecating this ArabicLetterTokenizer is the persian 
case, since i dont think UAX#29 will split on zero-width-non-joiner, so we need 
to do something to handle that case, otherwise we can default to a better 
tokenizer here.

Robert, can you provide more detail?  AFAICT from [this Wikipedia 
article|http://en.wikipedia.org/wiki/Zero-width_non-joiner], ZWNJs are used in 
Persian as display hints, not as word separators.

> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> --
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
>Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> a...@hotmail.com
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [a...@hotmail.com]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929385#action_12929385
 ] 

Robert Muir commented on LUCENE-2167:
-

{quote}
You've convinced me, though I don't think this idea has been around long enough 
to qualify as intiutive.
{quote}

Well obviously i dont have hard references to this stuff, but from my 
interaction with my own users, most of them
dont even think of double quotes as doing phrases, nor are they technical 
enough to even know what a phrase
is or what that means for a search... they just think of it as more exact.

{quote}
I think if we remove EMAIL/HOSTNAME recognition, we need to have an alternative 
that provides the same thing. So we would have UAX#29 tokenizer as default; a 
UAX29+EMAIL+HOSTNAME tokenizer as the equivalent to the pre-3.1 
StandardTokenizer; and a UAX29+URL+EMAIL tokenizer (current StandardTokenizer). 
Or maybe the last two could be combined: a UAX29+URL+EMAIL tokenizer that 
provides a configurable feature to not output URLs, but instead HOSTNAMEs and 
URL component tokens?
{quote}

Well, like i said, i'm not particularly picky, especially since someone can 
always use ClassicTokenizer to get the old behavior,
which, no one could ever agree on and there was constantly issues about not 
recognizing my company's name etc etc.

To some extent, i like UAX#29 because there's someone else making and 
standardizing the decisions and validating
its not gonna annoy users of major languages, and making sure it works well by 
default: like its not gonna be the most 
full-featured tokenizer but theres little chance it will be really annoying: i 
think this is great for "defaults".

as for all the other "bonus" stuff we can always make options, especially if 
its some pluggable thing somehow (sorry not sure about how this could work in 
jflex)
where you could have options as to what you want to do.

but again, i think UAX#29 itself is more than sufficient by default, and even 
hostname etc is pretty dangerous *by default* 
(again my example of searching partial hostnames being flexible to the end-user 
and not baked-in, by letting them using quotes).


> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929382#action_12929382
 ] 

Steven Rowe commented on LUCENE-2167:
-

bq. So i think its just intuitive and becoming rather universal to put quotes 
around things to get a "more exact search".

You've convinced me, though I don't think this idea has been around long enough 
to qualify as intiutive.

bq. hostnames are just an example, why do we recognize them and not filenames?

Although following precedent is important (principle of least surprise), we 
have to be able to revisit these decisions.  My philosophy tends toward 
kitchen-sinkness, while allowing people to ignore the stuff they don't want 
(today).  So, yeah, I think we *should* (be able to) recognize filenames, at 
least as part of a URL-decomposing filter:

{noformat}http://www.example.com/path/file%20name.html?param=value#fragment{noformat}
=> 
{noformat}http://www.example.com/path/file%20name.html?param=value#fragment{noformat}
 
www.example.com 
example.com 
example 
com 
path 
file name.html 
file name 
file 
name 
html 
param 
value 
fragment 

Output of each token type could be optional in a URL decomposition filter.  The 
URL decomposition filter could serve as a place to handle punycode, too.

bq. i'm not too picky how we solve the problem, but i think UAX#29 is a great 
default... its used everywhere else...

I think if we remove EMAIL/HOSTNAME recognition, we need to have an alternative 
that provides the same thing.  So we would have UAX#29 tokenizer as default; a 
UAX29+EMAIL+HOSTNAME tokenizer as the equivalent to the pre-3.1 
StandardTokenizer; and a UAX29+URL+EMAIL tokenizer (current StandardTokenizer). 
 Or maybe the last two could be combined: a UAX29+URL+EMAIL tokenizer that 
provides a configurable feature to not output URLs, but instead HOSTNAMEs and 
URL component tokens?

> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929380#action_12929380
 ] 

Robert Muir commented on LUCENE-2167:
-

bq. "www.facebook.com" is way non-intuitive

well, i'm just saying that the "UAX#29" behavior i describe, people are used to:
* google and twitter search engines find and highlight say 'cnn' in urls such 
as 'http:/www.cnn.com/x/y'
* this is how "find" in apps such as browsers, word processors, even windows 
notepad work.
* the idea of putting quotes around things to be "more exact" is pretty 
general, e.g. in google i refine queries like "documents" with quotes to 
prevent stemming: try it.

So i think its just intuitive and becoming rather universal to put quotes 
around things to get a "more exact search".

Like i said, i'm not too picky how we solve the problem, but i think UAX#29 is 
a great default... its used everywhere else...

> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929379#action_12929379
 ] 

Steven Rowe commented on LUCENE-2167:
-

bq. if you recognize www.facebook.com but my app wants to find this with a 
query of 'facebook', it cant. yet if just stick to uax#29, if a user queries on 
www.facebook.com, and they are unsatisfied with the results, that  

"www.facebook.com" is way non-intuitive.  My guess is the average user would 
never go there: how is something a phrase, and in need of bounding quotes, if 
it has no spaces in it?

> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929377#action_12929377
 ] 

Steven Rowe commented on LUCENE-2167:
-

bq. Would it somehow be possible to allow multiple Tokenizers to work together? 

The only thing I can think of right now is a new kind of component that feeds 
raw text (or post-char-filter text) to a configurable set of 
tokenizers/recognizers, then melds their results using some (hopefully 
configurable) strategy, like "longest-match-wins" or 
"create-overlapping-tokens", etc.  This would slow things down, of course, 
since analysis has to be performed multiple times over the same chunk of input 
text...

> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929376#action_12929376
 ] 

Robert Muir commented on LUCENE-2167:
-

bq. So we're talking about two separate issues here: a) Lucene's default 
behavior; and b) Lucene's capabilities.

agreed!

bq. For a), you'll have a lot of 'splaining to do if you drop existing 
functionality (e.g. email and hostname "recognition" - where quotes indicate 
"bad" things, right? "Cool"!)

to me recognizing hostnames is specific to what one application might want.
if you recognize www.facebook.com but my app wants to find this with a query of 
'facebook', it cant.
yet if just stick to uax#29, if a user queries on www.facebook.com, and they 
are unsatisfied with the results,
that user can always "refine" their query by searching on "www.facebook.com" 
and they get a phrasequery.
I think this is pretty intuitive and users are used to this... again this is 
just for general defaults...

and again, hostnames are just an example, why do we recognize them and not 
filenames?
yet a lot of people are happy being able to do 'partial filename' matching and 
not the whole path...
users that are unhappy with this 'default' behavior can use double quotes to 
refine their results.

and in both cases, apps that need something more specific can use a custom 
tokenizer.

bq.  Why not? Why shouldn't Lucene be a catch-all for "cool" linguistic stuff?

In this case I think analysis won't meet their needs anyway. a lot of people 
wanting to recognize full urls or proper names (mike's example)
actually want to do this in the 'document build' and dump the extracted 
entities into a separate field, so they can do things like
facet on this field, or find other documents that refer to the same person. 
This is because they are trying to 'find structure in the unstructured',
but it starts to get complicated if we mix this problem with 'feature 
extraction' which is what i think analysis should be.





> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929372#action_12929372
 ] 

Steven Rowe commented on LUCENE-2167:
-

bq. I just tend to really like plain old "uax#29" as a default [...] i would 
prefer if we tried to keep the complexity down

So we're talking about two separate issues here: a) Lucene's default behavior; 
and b) Lucene's capabilities. 

For a), you'll have a lot of 'splaining to do if you drop existing 
functionality (e.g. email and hostname "recognition" -- where quotes indicate 
"bad" things, right? "Cool"!)

For b), you appear to agree with Marvin Humphries about keeping the product 
lean and mean: complexity (a.k.a. functionality beyond the default) is bad 
because it creates maintenance problems.

bq. we should try to not make analysis the "wonder-do-it-all" machine.

Why not?  Why shouldn't Lucene be a catch-*all* for "cool" linguistic stuff?


> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929370#action_12929370
 ] 

Robert Muir commented on LUCENE-2167:
-

{quote}
But if, say, we had a Tokenizer that recognizes hostnames/URLs, one that 
recognizes email addresses, one for proper names/places/date/time, other app 
dependent stuff like detecting part numbers and what not, then ideally one 
could simply cascade/compose these tokenizers at will to build up whatever 
"initial" tokenizer you need for you chain?

I think our current lack of composability of the initial tokenizer ("there can 
be only one") makes cases like this hard...
{quote}

I agree that sounds like a "cool" idea to have, but at the same time, we should 
try to not make analysis the "wonder-do-it-all" machine.
I mean some stuff belongs in the app, and i think that includes a lot of things 
you mentioned... e.g. the app can do "NER" and pull
out proper names/places/dates and put them in separate fields.

I don't think the analysis chain is the easiest or best place to do this, i 
would prefer if we tried to keep the complexity down and recognize
that some things (really a lot of this "recognizer" stuff) might be better 
implemented in the app.




> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929369#action_12929369
 ] 

Robert Muir commented on LUCENE-2167:
-

bq. UAX29Tokenizer does not have email or hostname recognition. 
StandardTokenizer has long had these capabilities (though not standard-based). 
Removing them would be bad.

Thats true, so maybe something in the "middle" / "compromise" is better as a 
default.

I just tend to really like plain old "uax#29" as a default, since its 
consistent with how "tokenization" works elsewhere in people's wordprocessors, 
browsers, etc
(e.g. control-F find, that sort of thing), where they dont know specifics of 
content and want to just have a reasonable default.

but there might be something else we can do, too.


> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929368#action_12929368
 ] 

Michael McCandless commented on LUCENE-2167:


Would it somehow be possible to allow multiple Tokenizers to work together?

Today we only allow one (and then any number of subsequent TokenFilters) in the 
chain, so if your Tokenizer destroys information (eg erases the . from the host 
name) it's hard for subsequent TokenFilters to put them back.

But if, say, we had a Tokenizer that recognizes hostnames/URLs, one that 
recognizes email addresses, one for proper names/places/date/time, other app 
dependent stuff like detecting part numbers and what not, then ideally one 
could simply cascade/compose these tokenizers at will to build up whatever 
"initial" tokenizer you need for you chain?

I think our current lack of composability of the initial tokenizer ("there can 
be only one") makes cases like this hard...

> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread M Alexander (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929367#action_12929367
 ] 

M Alexander commented on LUCENE-2745:
-

Yes Robert, I have faced the diacritics problem. I am trying to have an 
Analyzer that would not break on diacritics as well as recognising email 
addresses, hostnames and so on (which Arabic text may contain). This is why I 
asked the question to see if there is a way to have full Arabic analysis 
(including diacritics) along with recognising email addresses, hostnames, etc 
at the same Analyzer. I will try your suggestions and will share the output. 
Thanks Robert for your help

> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> --
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
>Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> a...@hotmail.com
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [a...@hotmail.com]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929366#action_12929366
 ] 

Steven Rowe commented on LUCENE-2167:
-

bq. i would prefer we switch the situation around: make UAX#29 
'standardtokenizer' and give the uax#29+url+email+ip+... a different name.

UAX29Tokenizer does not have email or hostname recognition.  StandardTokenizer 
has long had these capabilities (though not standard-based).  Removing them 
would be bad.

> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929365#action_12929365
 ] 

Robert Muir commented on LUCENE-2745:
-

I agree with what Steven said here... since previously StandardTokenizer would 
break on diacritics (shadda, etc)
it wasn't appropriate for arabic writing systems, so we added 
ArabicLetterTokenizer as a workaround.

but you can use a different tokenizer in your own Analyzer to meet your 
needs... and we should try to avoid 
(deprecate+remove) language-specific tokenizers if we can.

the only trick to deprecating this ArabicLetterTokenizer is the persian case, 
since i dont think UAX#29 will split on
zero-width-non-joiner, so we need to do something to handle that case, 
otherwise we can default to a better tokenizer here.


> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> --
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
>Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> a...@hotmail.com
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [a...@hotmail.com]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread M Alexander (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929364#action_12929364
 ] 

M Alexander commented on LUCENE-2745:
-

Thanks Steven. I will give it a go and will share the outcome.

> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> --
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
>Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> a...@hotmail.com
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [a...@hotmail.com]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929363#action_12929363
 ] 

Robert Muir commented on LUCENE-2167:
-

bq.  because when people want full URLs, they can't be reassembled after the 
separator chars are thrown away by the tokenizer.

Well, i dont much like this argument, because its true about anything.
Indexing text for search is a lossy thing by definition.

yeah, when you tokenize this stuff, you lose paragraphs, sentences, all kinds 
of things.
should we output whole paragraphs as tokens so its not lost? 

bq. Robert, when I mentioned the decomposition filter, you said you didn't like 
that idea. Do you still feel the same?

Well, i said it was a can of worms, i still feel that it is complicated, yes.
But i mean we do have a ghetto decomposition filter (WordDelimiterFilter) 
already.
And someone can use this with the UAX#29+URLRecognizingTokenizer to index these 
urls in a variety of ways, including preserving the original full url too.

bq. Would a URL decomposition filter, with full URL emission turned off by 
default, work here?

It works in theory, but its confusing that its 'required' to not get absymal 
tokens.
i would prefer we switch the situation around: make UAX#29 'standardtokenizer' 
and give the uax#29+url+email+ip+... a different name.

because to me, uax#29 handles urls in nice ways, e.g. my user types 'facebook' 
and they get back facebook.com
its certainly simple and won't blow up terms dictionaries...

otherwise, creating lots of long, unique tokens (urls) by default is a serious 
performance trap, particularly for lucene 3.x



> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929362#action_12929362
 ] 

Steven Rowe commented on LUCENE-2745:
-

I think that ArabicLetterTokenizer, which is the tokenizer used by 
ArabicAnalyzer, is obsolete (as of version 3.1), since StandardTokenizer, which 
implements the Unicode word segmentation rules from UAX#29, should be able to 
properly tokenize Arabic.  StandardTokenizer recognizes email addresses, 
hostnames, and URLs, so your concern would be addressed.  (See LUCENE-2167, 
though, which was just reopened to turn off full URL output.)

You can test this by composing your own analyzer, if you're willing to try 
using using as-yet-unreleased branch_3X, from which 3.1 will be cut (hopefully 
fairly soon): just copy ArabicAnalyzer class and swap in StandardTokenizer for 
ArabicLetterTokenizer

> ArabicAnalyzer - the ability to recognise email addresses host names and so on
> --
>
> Key: LUCENE-2745
> URL: https://issues.apache.org/jira/browse/LUCENE-2745
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
> Environment: All
>Reporter: M Alexander
>
> The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
> For example,
> a...@hotmail.com
> will be tokenised to [adam] [hotmail] [com]
> It would be great if the ArabicAnalyzer can tokenises this to 
> [a...@hotmail.com]. The same applies to hostnames and so on.
> Can this be resolved? I hope so
> Thanks
> MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929361#action_12929361
 ] 

Steven Rowe commented on LUCENE-2167:
-

bq. I don't think StandardTokenizer should produce whole URLs as tokens, to 
begin with.

I think Standard *Analyzer* should not by default produce whole URLs as tokens. 
 But (yay repetition!) if the tokenizer throws away the separator chars, URLs 
can't be reassembled from their parts.

Would a URL decomposition filter, with full URL emission turned off by default, 
work here?


> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929359#action_12929359
 ] 

Steven Rowe commented on LUCENE-2167:
-

{quote}
I think that full urls as tokens is not a good default for StandardTokenizer, 
because i don't think users ever search
on full URLS.
{quote}

Probably true, but this is a chicken and egg issue, no?  Maybe people never 
search on full URLs because it doesn't work, because there is no tokenization 
support for it?

My preferred solution here, as I [said earlier in this issue|#action_12865759], 
is to use a decomposing filter, because when people want full URLs, they can't 
be reassembled after the separator chars are thrown away by the tokenizer.

Robert, when I mentioned the decomposition filter, you [said|#action_12865879] 
you didn't like that idea.  Do you still feel the same?

I'm really reluctant to drop the ability to recognize full URLs.  I agree, 
though, that as a default it's not good.


> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene-Solr-tests-only-trunk - Build # 1071 - Still Failing

2010-11-07 Thread Michael McCandless
Sadly, yes, this is SimpleText's "speedup", heh.

However, the OOM actually occurs when the RIW used by TestNRQ64 calls
optimize on .close().  But, this optimize is unused since the test has
already pulled its reader.  I think a simple workaround would be to
close the RIW but never let it call optimize?  Eg we could just reach
in and close the underlying writer (writer.w.close)... or add a
.closeNoOptimize to RIW?

Mike

On Sat, Nov 6, 2010 at 6:07 PM, Uwe Schindler  wrote:
> Hehe, yes!
>
> This test creates lots of terms:
> As far as I remember, it indexes 10,000 documents in 9 different number 
> styles and precisionSteps (2, 4, 6, 8):
>
> But when RANDOM_MULTIPLIER > 1 then much more:
>  private static final int noDocs = 1 * RANDOM_MULTIPLIER;
>
> As the lowerPrec terms are duplicates, the approx. number of terms (I assume 
> that for next shift value, the number of terms reduces by 2 because of 
> overlaps):
> precStep=8: 10 000 + 5 000 + 2 500 + 1 250 + 625 + 312 + 161 + 80 = 19 928 
> terms
> precStep=6: 10 000 + 5 000 + 2 500 + 1 250 + 625 + 312 + 161 + 80 + 40 + 20 + 
> 10 = 19 998 terms
> precStep=4: some more (16 summands) :-], approx. 20 500
> precStep=2: again more (32 summands), approx. 20800
>
> This makes approx 9*20,000*RANDOM_MULTIPLIER (its 3 on Hudson)  terms in the 
> index -- ahm TreeSet incl TermInfo *g*. For TestNumericRangeQuery32 its 
> similar, but fewer precsteps.
>
> We have several options: Run all tests with default Heap size only for 
> SimpleText and check that all tests pass, if not raise -Xmx until it passes 
> (currently tests use 512 M).
>
> Alternatively reduce the index size for some tests if SimpleText codec is 
> used. TestNumeric should be easy possible, as the test uses a some 
> preconfigured constants for building index and run tests.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Robert Muir [mailto:rcm...@gmail.com]
>> Sent: Saturday, November 06, 2010 9:03 PM
>> To: dev@lucene.apache.org
>> Subject: Re: Lucene-Solr-tests-only-trunk - Build # 1071 - Still Failing
>>
>> On Sat, Nov 6, 2010 at 3:58 PM, Apache Hudson Server
>>  wrote:
>> > Error Message:
>> > Java heap space
>> >
>> ...
>> > Error Message:
>> > MockDirectoryWrapper: cannot close: there are still open files:
>> > {_2h.pst=1}
>>
>> seems like its likely caused by the simpletext optimization?
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
>> commands, e-mail: dev-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929327#action_12929327
 ] 

Michael McCandless commented on LUCENE-2167:


+1

When I indexed Wikipedia w/ StandardAnalyzer I saw a huge number of full-url 
tokens, which is just silly as a default.  Inserting WordDelimiterFilter fixed 
it, but, I don't think StandardTokenizer should produce whole URLs as tokens, 
to begin with.

> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-07 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reopened LUCENE-2167:
-


I'd like to re-open this issue.

I think that full urls as tokens is not a good default for StandardTokenizer, 
because i don't think users ever search
on full URLS. its also dangerous, many apps that upgrade will find themselves 
with huge terms dictionaries, 
and different performance characteristics.

i think it would be better if standardtokenizer just implemented the uax#29 
algorithm. the url identification we could
keep as a separate tokenizer for people that want that.

> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1, 4.0
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 1094 - Failure

2010-11-07 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1094/

1 tests failed.
REGRESSION:  org.apache.solr.client.solrj.TestLBHttpSolrServer.testSimple

Error Message:
expected:<3> but was:<2>

Stack Trace:
junit.framework.AssertionFailedError: expected:<3> but was:<2>
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.solr.client.solrj.TestLBHttpSolrServer.testSimple(TestLBHttpSolrServer.java:126)




Build Log (for compile errors):
[...truncated 8690 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Solr-trunk - Build # 1305 - Still Failing

2010-11-07 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Solr-trunk/1305/

All tests passed

Build Log (for compile errors):
[...truncated 18440 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-3.x - Build # 1077 - Failure

2010-11-07 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/1077/

1 tests failed.
REGRESSION:  org.apache.lucene.search.TestThreadSafe.testLazyLoadThreadSafety

Error Message:
unable to create new native thread

Stack Trace:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:614)
at 
org.apache.lucene.search.TestThreadSafe.doTest(TestThreadSafe.java:133)
at 
org.apache.lucene.search.TestThreadSafe.testLazyLoadThreadSafety(TestThreadSafe.java:152)
at 
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:253)




Build Log (for compile errors):
[...truncated 8819 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org