from:"Koji Sekiguchi \(Updated\) \(JIRA\)"

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

2012-03-26 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3888:
---

Fix Version/s: (was: 3.6)

Thanks Robert for giving some patches and comment.

{quote}
The only option for 3.6 would be something like my previous patch
(https://issues.apache.org/jira/secure/attachment/12519860/LUCENE-3888.patch) 
which
has the disadvantages of doing the second-phase re-ranking on surface forms.
{quote}

With the disadvantages, the spell checker won't work well for Japanese anyway. 
I give up this for 3.6.

 split off the spell check word and surface form in spell check dictionary
 -

 Key: LUCENE-3888
 URL: https://issues.apache.org/jira/browse/LUCENE-3888
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, 
 LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch


 The did you mean? feature by using Lucene's spell checker cannot work well 
 for Japanese environment unfortunately and is the longstanding problem, 
 because the logic needs comparatively long text to check spells, but for some 
 languages (e.g. Japanese), most words are too short to use the spell checker.
 I think, for at least Japanese, the things can be improved if we split off 
 the spell check word and surface form in the spell check dictionary. Then we 
 can use ReadingAttribute for spell checking but CharTermAttribute for 
 suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

2012-03-24 Thread Koji Sekiguchi (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Koji Sekiguchi updated LUCENE-3888:
---

Attachment: LUCENE-3888.patch

I added a test for the surface analyzer. I also added code for the analyzer in
Solr.

Currently, due to classpath problem, the test cannot be compiled. I should dig
in, but if someone could, it would be appreciated.

split off the spell check word and surface form in spell check dictionary
-

Key: LUCENE-3888
URL: https://issues.apache.org/jira/browse/LUCENE-3888
Project: Lucene - Java
Issue Type: Improvement
Components: modules/spellchecker
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
Fix For: 3.6, 4.0

Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch,
LUCENE-3888.patch

The did you mean? feature by using Lucene's spell checker cannot work well
for Japanese environment unfortunately and is the longstanding problem,
because the logic needs comparatively long text to check spells, but for some
languages (e.g. Japanese), most words are too short to use the spell checker.
I think, for at least Japanese, the things can be improved if we split off
the spell check word and surface form in the spell check dictionary. Then we
can use ReadingAttribute for spell checking but CharTermAttribute for
suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

2012-03-20 Thread Koji Sekiguchi (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Koji Sekiguchi updated LUCENE-3888:
---

Attachment: LUCENE-3888.patch

The patch cannot be compiled now because I changed the return type of the
method in Dictionary interface but all implemented classes have not been
changed.

Please give some comment because I'm new to spell checker. If no problem to go,
I'll continue to work.

split off the spell check word and surface form in spell check dictionary
-

Attachments: LUCENE-3888.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2418) remove deprecated highlighting/ syntax

2012-03-19 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2418:
-

Fix Version/s: (was: 3.6)

 remove deprecated highlighting/ syntax
 

 Key: SOLR-2418
 URL: https://issues.apache.org/jira/browse/SOLR-2418
 Project: Solr
  Issue Type: Task
  Components: highlighter
Affects Versions: 1.4.1, 3.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 4.0


 excerpt from CHANGES.txt:
 {noformat}
 ==  3.1.0-dev ==
   snip
 Upgrading from Solr 1.4
 --
   snip
 * Old syntax of highlighting configuration in solrconfig.xml
   is deprecated (SOLR-1696)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2202) Money/Currency FieldType

2012-03-10 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2202:
-

Attachment: SOLR-2202-fix-NPE-if-no-tlong-fieldType.patch

A draft patch which hasn't been tested yet.

 Money/Currency FieldType
 

 Key: SOLR-2202
 URL: https://issues.apache.org/jira/browse/SOLR-2202
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Affects Versions: 1.5
Reporter: Greg Fodor
Assignee: Jan Høydahl
 Fix For: 3.6, 4.0

 Attachments: SOLR-2022-solr-3.patch, 
 SOLR-2202-fix-NPE-if-no-tlong-fieldType.patch, SOLR-2202-lucene-1.patch, 
 SOLR-2202-solr-1.patch, SOLR-2202-solr-10.patch, SOLR-2202-solr-2.patch, 
 SOLR-2202-solr-4.patch, SOLR-2202-solr-5.patch, SOLR-2202-solr-6.patch, 
 SOLR-2202-solr-7.patch, SOLR-2202-solr-8.patch, SOLR-2202-solr-9.patch, 
 SOLR-2202.patch, SOLR-2202.patch, SOLR-2202.patch, SOLR-2202.patch, 
 SOLR-2202.patch, SOLR-2202.patch, SOLR-2202.patch, SOLR-2202.patch


 Provides support for monetary values to Solr/Lucene with query-time currency 
 conversion. The following features are supported:
 - Point queries
 - Range quries
 - Sorting
 - Currency parsing by either currency code or symbol.
 - Symmetric  Asymmetric exchange rates. (Asymmetric exchange rates are 
 useful if there are fees associated with exchanging the currency.)
 At indexing time, money fields can be indexed in a native currency. For 
 example, if a product on an e-commerce site is listed in Euros, indexing the 
 price field as 1000,EUR will index it appropriately. By altering the 
 currency.xml file, the sorting and querying against Solr can take into 
 account fluctuations in currency exchange rates without having to re-index 
 the documents.
 The new money field type is a polyfield which indexes two fields, one which 
 contains the amount of the value and another which contains the currency code 
 or symbol. The currency metadata (names, symbols, codes, and exchange rates) 
 are expected to be in an xml file which is pointed to by the field type 
 declaration in the schema.xml.
 The current patch is factored such that Money utility functions and 
 configuration metadata lie in Lucene (see MoneyUtil and CurrencyConfig), 
 while the MoneyType and MoneyValueSource lie in Solr. This was meant to 
 mirror the work being done on the spacial field types.
 This patch will be getting used to power the international search 
 capabilities of the search engine at Etsy.
 Also see WIKI page: http://wiki.apache.org/solr/MoneyFieldType

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2909) Add support for ResourceLoaderAware tokenizerFactories in synonym filter factories

2012-02-20 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2909:
-

Attachment: SOLR-2909.patch

 Add support for ResourceLoaderAware tokenizerFactories in synonym filter 
 factories
 --

 Key: SOLR-2909
 URL: https://issues.apache.org/jira/browse/SOLR-2909
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.4, 4.0
Reporter: Tom Klonikowski
Assignee: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2909.patch


 The optional custom tokenizerFactory in SlowSynonymFilterFactory and 
 FSTSynonymFilterFactory might require the ResourceLoader information. Thus 
 inform(ResourceLoader) should be called if the specified tokenizerFactory is 
 an instance of ResourceLoaderAware.
 {noformat}
 private static TokenizerFactory loadTokenizerFactory(ResourceLoader loader, 
 String cname, MapString, String args) {
   TokenizerFactory tokFactory = (TokenizerFactory) loader.newInstance(cname);
   tokFactory.init(args);
   if (tokFactory instanceof ResourceLoaderAware) {
 ((ResourceLoaderAware) tokFactory).inform(loader);
   }
   return tokFactory;
 }
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3055) Use NGramPhraseQuery in Solr

2012-01-30 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-3055:
-

Attachment: SOLR-3055.patch

How about introducing something like GramSizeAttribute?

I attached just an idea and draft level patch.

 Use NGramPhraseQuery in Solr
 

 Key: SOLR-3055
 URL: https://issues.apache.org/jira/browse/SOLR-3055
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis, search
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-3055.patch


 Solr should use NGramPhraseQuery when searching with default slop on n-gram 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3698) FastVectorHighlighter adds a multi value separator (space) to the end of the highlighted text

2012-01-16 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3698:
---

Attachment: LUCENE-3698.patch

Updated patch. As I think this fix changes runtime behavior, I added an entry 
to CHANGES.txt. I changed Field.isTokenized() to Field.fieldType().tokenized() 
as well because it cannot be compiled in trunk.

 FastVectorHighlighter adds a multi value separator (space) to the end of the 
 highlighted text
 -

 Key: LUCENE-3698
 URL: https://issues.apache.org/jira/browse/LUCENE-3698
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Reporter: Shay Banon
 Attachments: LUCENE-3698.patch, LUCENE-3698.patch


 The FVH adds an additional ' ' (the multi value separator) to the end of the 
 highlighted text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3698) FastVectorHighlighter adds a multi value separator (space) to the end of the highlighted text

2012-01-16 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3698:
---

 Priority: Minor  (was: Major)
Fix Version/s: 4.0
   3.6
 Assignee: Koji Sekiguchi

 FastVectorHighlighter adds a multi value separator (space) to the end of the 
 highlighted text
 -

 Key: LUCENE-3698
 URL: https://issues.apache.org/jira/browse/LUCENE-3698
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Reporter: Shay Banon
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3698.patch, LUCENE-3698.patch


 The FVH adds an additional ' ' (the multi value separator) to the end of the 
 highlighted text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3697) FastVectorHighlighter SimpleBoundaryScanner does not work well when highlighting at the beginning of the text

2012-01-15 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3697:
---

Attachment: LUCENE-3697.patch

patch that just fixing the expected string in #testMVSeparator.

 FastVectorHighlighter SimpleBoundaryScanner does not work well when 
 highlighting at the beginning of the text 
 --

 Key: LUCENE-3697
 URL: https://issues.apache.org/jira/browse/LUCENE-3697
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/highlighter
Reporter: Shay Banon
 Attachments: LUCENE-3697.patch, LUCENE-3697.patch


 The SimpleBoundaryScanner still breaks text not based on characters provided 
 when highlighting text that end up scanning to the beginning of the text to 
 highlight. In this case, just use the start of the text as the offset.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3012) SimplePostTool: move getProperty(type) out of postData()

2012-01-09 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-3012:
-

Fix Version/s: 4.0
   3.6

 SimplePostTool: move getProperty(type) out of postData()
 

 Key: SOLR-3012
 URL: https://issues.apache.org/jira/browse/SOLR-3012
 Project: Solr
  Issue Type: Improvement
Reporter: Koji Sekiguchi
Priority: Trivial
 Fix For: 3.6, 4.0

 Attachments: SOLR-3012.patch


 Now applications that use SimplePostTool can set Content-type, but it has to 
 use type system property.
 {code}
 public void postData(InputStream data, Integer length, OutputStream output) {
 
   final String type = System.getProperty(type, DEFAULT_DATA_TYPE);
   :
 }
 {code}
 If the getProperty() is moved to main() and type can be set via an argument 
 of the method, the client applications can be flexible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3012) SimplePostTool: move getProperty(type) out of postData()

2012-01-09 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-3012:
-

Attachment: SOLR-3012.patch

New patch. As the clients of SimplePostTool will be needed recompile due to 
this change, I added change note in this patch.

I'll commit tonight if nobody objects.

 SimplePostTool: move getProperty(type) out of postData()
 

 Key: SOLR-3012
 URL: https://issues.apache.org/jira/browse/SOLR-3012
 Project: Solr
  Issue Type: Improvement
Reporter: Koji Sekiguchi
Priority: Trivial
 Fix For: 3.6, 4.0

 Attachments: SOLR-3012.patch, SOLR-3012.patch


 Now applications that use SimplePostTool can set Content-type, but it has to 
 use type system property.
 {code}
 public void postData(InputStream data, Integer length, OutputStream output) {
 
   final String type = System.getProperty(type, DEFAULT_DATA_TYPE);
   :
 }
 {code}
 If the getProperty() is moved to main() and type can be set via an argument 
 of the method, the client applications can be flexible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3012) SimplePostTool: move getProperty(type) out of postData()

2012-01-08 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-3012:
-

Attachment: SOLR-3012.patch

I also changed doGet methods to static in this patch.

 SimplePostTool: move getProperty(type) out of postData()
 

 Key: SOLR-3012
 URL: https://issues.apache.org/jira/browse/SOLR-3012
 Project: Solr
  Issue Type: Improvement
Reporter: Koji Sekiguchi
Priority: Trivial
 Attachments: SOLR-3012.patch


 Now applications that use SimplePostTool can set Content-type, but it has to 
 use type system property.
 {code}
 public void postData(InputStream data, Integer length, OutputStream output) {
 
   final String type = System.getProperty(type, DEFAULT_DATA_TYPE);
   :
 }
 {code}
 If the getProperty() is moved to main() and type can be set via an argument 
 of the method, the client applications can be flexible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-12-27 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2346:
-

Attachment: SOLR-2346.patch

New patch attached. I updated for current trunk and getCharsetFromContentType() 
method to remove unnecessary strings after the charset value.

I think this is ready to go.

 Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
 getting indexed correctly.
 ---

 Key: SOLR-2346
 URL: https://issues.apache.org/jira/browse/SOLR-2346
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1, 3.1, 4.0
 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
 XP SP1, Machine was booted in Japanese Locale.
Reporter: Prasad Deshpande
Assignee: Koji Sekiguchi
Priority: Critical
 Fix For: 3.6, 4.0

 Attachments: NormalSave.msg, SOLR-2346.patch, SOLR-2346.patch, 
 UnicodeSave.msg, sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt


 I am able to successfully index/search non-Engilsh files (like Hebrew, 
 Japanese) which was encoded in UTF-8. However, When I tried to index data 
 which was encoded in local encoding like Big5 for Japanese I could not see 
 the desired results. The contents after indexing looked garbled for Big5 
 encoded document when I searched for all indexed documents. When I index 
 attached non utf-8 file it indexes in following way
 - result name=response numFound=1 start=0
 - doc
 - arr name=attr_content
   str�� ��/str
   /arr
 - arr name=attr_content_encoding
   strBig5/str
   /arr
 - arr name=attr_content_language
   strzh/str
   /arr
 - arr name=attr_language
   strzh/str
   /arr
 - arr name=attr_stream_size
   str17/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
   /result
   /response
 Here you said it index file in UTF8 however it seems that non UTF8 file gets 
 indexed in Big5 encoding.
 Here I tried fetching indexed data stream in Big5 and converted in UTF8.
 String id = (String) resulDocument.getFirstValue(attr_content);
 byte[] bytearray = id.getBytes(Big5);
 String utf8String = new String(bytearray, UTF-8);
 It does not gives expected results.
 When I index UTF-8 file it indexes like following
 - doc
 - arr name=attr_content
   strマイ ネットワーク/str
   /arr
 - arr name=attr_content_encoding
   strUTF-8/str
   /arr
 - arr name=attr_stream_content_type
   strtext/plain/str
   /arr
 - arr name=attr_stream_name
   strsample_jap_unicode.txt/str
   /arr
 - arr name=attr_stream_size
   str28/str
   /arr
 - arr name=attr_stream_source_info
   strmyfile/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
 So, I can index and search UTF-8 data.
 For more reference below is the discussion with Yonik.
 Please find attached TXT file which I was using to index and search.
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8;
  -F myfile=@sample_jap_non_UTF-8
 One problem is that you are giving big5 encoded text to Solr and saying that 
 it's UTF8.
 Here's one way to actually tell solr what the encoding of the text you are 
 sending is:
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true;
  --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; 
 charset=big5'
 Now the problem appears that for some reason, this doesn't work...
 Could you open a JIRA issue and attach your two test files?
 -Yonik
 http://lucidimagination.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-12-27 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2346:
-

Attachment: SOLR-2346.patch

bq. getCharsetFromContentType() method to remove unnecessary strings after the 
charset value.

My fault. This is not necessary. I should add --data-binary option to curl.

 Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
 getting indexed correctly.
 ---

 Key: SOLR-2346
 URL: https://issues.apache.org/jira/browse/SOLR-2346
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1, 3.1, 4.0
 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
 XP SP1, Machine was booted in Japanese Locale.
Reporter: Prasad Deshpande
Assignee: Koji Sekiguchi
Priority: Critical
 Fix For: 3.6, 4.0

 Attachments: NormalSave.msg, SOLR-2346.patch, SOLR-2346.patch, 
 SOLR-2346.patch, UnicodeSave.msg, sample_jap_UTF-8.txt, 
 sample_jap_non_UTF-8.txt


 I am able to successfully index/search non-Engilsh files (like Hebrew, 
 Japanese) which was encoded in UTF-8. However, When I tried to index data 
 which was encoded in local encoding like Big5 for Japanese I could not see 
 the desired results. The contents after indexing looked garbled for Big5 
 encoded document when I searched for all indexed documents. When I index 
 attached non utf-8 file it indexes in following way
 - result name=response numFound=1 start=0
 - doc
 - arr name=attr_content
   str�� ��/str
   /arr
 - arr name=attr_content_encoding
   strBig5/str
   /arr
 - arr name=attr_content_language
   strzh/str
   /arr
 - arr name=attr_language
   strzh/str
   /arr
 - arr name=attr_stream_size
   str17/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
   /result
   /response
 Here you said it index file in UTF8 however it seems that non UTF8 file gets 
 indexed in Big5 encoding.
 Here I tried fetching indexed data stream in Big5 and converted in UTF8.
 String id = (String) resulDocument.getFirstValue(attr_content);
 byte[] bytearray = id.getBytes(Big5);
 String utf8String = new String(bytearray, UTF-8);
 It does not gives expected results.
 When I index UTF-8 file it indexes like following
 - doc
 - arr name=attr_content
   strマイ ネットワーク/str
   /arr
 - arr name=attr_content_encoding
   strUTF-8/str
   /arr
 - arr name=attr_stream_content_type
   strtext/plain/str
   /arr
 - arr name=attr_stream_name
   strsample_jap_unicode.txt/str
   /arr
 - arr name=attr_stream_size
   str28/str
   /arr
 - arr name=attr_stream_source_info
   strmyfile/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
 So, I can index and search UTF-8 data.
 For more reference below is the discussion with Yonik.
 Please find attached TXT file which I was using to index and search.
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8;
  -F myfile=@sample_jap_non_UTF-8
 One problem is that you are giving big5 encoded text to Solr and saying that 
 it's UTF8.
 Here's one way to actually tell solr what the encoding of the text you are 
 sending is:
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true;
  --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; 
 charset=big5'
 Now the problem appears that for some reason, this doesn't work...
 Could you open a JIRA issue and attach your two test files?
 -Yonik
 http://lucidimagination.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3640) remove IndexSearcher.close

2011-12-11 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3640:
---

Fix Version/s: (was: 3.6)

remove 3.6 tag from Fix Version/s

 remove IndexSearcher.close
 --

 Key: LUCENE-3640
 URL: https://issues.apache.org/jira/browse/LUCENE-3640
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-3640.patch, LUCENE-3640.patch


 Now that IS is never heavy (since you have to pass in your own IR), 
 IS.close is truly a no-op... I think we should remove it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2922) Upgrade commons io and lang in Solr

2011-11-27 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2922:
-

Attachment: SOLR-2922.patch

 Upgrade commons io and lang in Solr
 ---

 Key: SOLR-2922
 URL: https://issues.apache.org/jira/browse/SOLR-2922
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.5, 4.0
Reporter: Koji Sekiguchi
Priority: Trivial
 Attachments: SOLR-2922.patch


 Upgrade commons-io and commons-lang in Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2949) FastVectorHighlighter FieldTermStack could likely benefit from using TermVectorMapper

2011-11-14 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-2949:
---

Assignee: (was: Koji Sekiguchi)

 FastVectorHighlighter FieldTermStack could likely benefit from using 
 TermVectorMapper
 -

 Key: LUCENE-2949
 URL: https://issues.apache.org/jira/browse/LUCENE-2949
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0.3, 4.0
Reporter: Grant Ingersoll
Priority: Minor
  Labels: FastVectorHighlighter, Highlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-2949.patch


 Based on my reading of the FieldTermStack constructor that loads the vector 
 from disk, we could probably save a bunch of time and memory by using the 
 TermVectorMapper callback mechanism instead of materializing the full array 
 of terms into memory and then throwing most of them out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1926) add hl.q parameter

2011-11-07 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-1926:
-

Attachment: SOLR-1926.patch

Just removing hl.text parameter.

 add hl.q parameter
 --

 Key: SOLR-1926
 URL: https://issues.apache.org/jira/browse/SOLR-1926
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 3.5, 4.0

 Attachments: SOLR-1926.patch, SOLR-1926.patch, SOLR-1926.patch, 
 SOLR-1926.patch


 If hl.q parameter is set, HighlightComponent uses it rather than q.
 Use case:
 You search PC with highlight and facet capability:
 {code}
 q=PC
 facet=onfacet.field=makerfacet.field=something
 hl=onhl.fl=desc
 {code}
 You get a lot of results with snippets (term PC highlighted in desc field). 
 Then you click a link maker:DELL(50) to narrow the result:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc
 {code}
 You'll get narrowed result with term PC highlighted snippets. But, 
 sometimes I'd like to see DELL to be highlighted as well, because I clicked 
 DELL. In this case, hl.q can be used:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc*hl.q=PC+maker:DELL*
 {code}
 Note that hl.requireFieldMatch should be false (false is default) in this 
 scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1926) add hl.q parameter

2011-11-07 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-1926:
-

Attachment: SOLR-1926.patch

Added localParams test for hl.q parameter.

I'll commit soon.

 add hl.q parameter
 --

 Key: SOLR-1926
 URL: https://issues.apache.org/jira/browse/SOLR-1926
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 3.5, 4.0

 Attachments: SOLR-1926.patch, SOLR-1926.patch, SOLR-1926.patch, 
 SOLR-1926.patch, SOLR-1926.patch


 If hl.q parameter is set, HighlightComponent uses it rather than q.
 Use case:
 You search PC with highlight and facet capability:
 {code}
 q=PC
 facet=onfacet.field=makerfacet.field=something
 hl=onhl.fl=desc
 {code}
 You get a lot of results with snippets (term PC highlighted in desc field). 
 Then you click a link maker:DELL(50) to narrow the result:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc
 {code}
 You'll get narrowed result with term PC highlighted snippets. But, 
 sometimes I'd like to see DELL to be highlighted as well, because I clicked 
 DELL. In this case, hl.q can be used:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc*hl.q=PC+maker:DELL*
 {code}
 Note that hl.requireFieldMatch should be false (false is default) in this 
 scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1926) add hl.q parameter

2011-11-06 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-1926:
-

Attachment: SOLR-1926.patch

This patch supports both hl.q and hl.text. The priority is:

# Highlighter looks if there is hl.text and if it exists, uses it. FVH doesn't 
look it for performance reasons.
# If hl.text doesn't exist, hl.q will be used.
# If hl.q doesn't exist, q will be used.

localParams can be used in hl.q, and hl.text parameter accepts per-field 
override.

 add hl.q parameter
 --

 Key: SOLR-1926
 URL: https://issues.apache.org/jira/browse/SOLR-1926
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Priority: Trivial
 Attachments: SOLR-1926.patch, SOLR-1926.patch


 If hl.q parameter is set, HighlightComponent uses it rather than q.
 Use case:
 You search PC with highlight and facet capability:
 {code}
 q=PC
 facet=onfacet.field=makerfacet.field=something
 hl=onhl.fl=desc
 {code}
 You get a lot of results with snippets (term PC highlighted in desc field). 
 Then you click a link maker:DELL(50) to narrow the result:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc
 {code}
 You'll get narrowed result with term PC highlighted snippets. But, 
 sometimes I'd like to see DELL to be highlighted as well, because I clicked 
 DELL. In this case, hl.q can be used:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc*hl.q=PC+maker:DELL*
 {code}
 Note that hl.requireFieldMatch should be false (false is default) in this 
 scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1926) add hl.q parameter

2011-11-06 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-1926:
-

Fix Version/s: 4.0
   3.5

 add hl.q parameter
 --

 Key: SOLR-1926
 URL: https://issues.apache.org/jira/browse/SOLR-1926
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 3.5, 4.0

 Attachments: SOLR-1926.patch, SOLR-1926.patch


 If hl.q parameter is set, HighlightComponent uses it rather than q.
 Use case:
 You search PC with highlight and facet capability:
 {code}
 q=PC
 facet=onfacet.field=makerfacet.field=something
 hl=onhl.fl=desc
 {code}
 You get a lot of results with snippets (term PC highlighted in desc field). 
 Then you click a link maker:DELL(50) to narrow the result:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc
 {code}
 You'll get narrowed result with term PC highlighted snippets. But, 
 sometimes I'd like to see DELL to be highlighted as well, because I clicked 
 DELL. In this case, hl.q can be used:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc*hl.q=PC+maker:DELL*
 {code}
 Note that hl.requireFieldMatch should be false (false is default) in this 
 scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1926) add hl.q parameter

2011-11-06 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-1926:
-

Attachment: SOLR-1926.patch

New patch. I added a test case.

 add hl.q parameter
 --

 Key: SOLR-1926
 URL: https://issues.apache.org/jira/browse/SOLR-1926
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Trivial
 Fix For: 3.5, 4.0

 Attachments: SOLR-1926.patch, SOLR-1926.patch, SOLR-1926.patch


 If hl.q parameter is set, HighlightComponent uses it rather than q.
 Use case:
 You search PC with highlight and facet capability:
 {code}
 q=PC
 facet=onfacet.field=makerfacet.field=something
 hl=onhl.fl=desc
 {code}
 You get a lot of results with snippets (term PC highlighted in desc field). 
 Then you click a link maker:DELL(50) to narrow the result:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc
 {code}
 You'll get narrowed result with term PC highlighted snippets. But, 
 sometimes I'd like to see DELL to be highlighted as well, because I clicked 
 DELL. In this case, hl.q can be used:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc*hl.q=PC+maker:DELL*
 {code}
 Note that hl.requireFieldMatch should be false (false is default) in this 
 scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1926) add hl.q parameter

2011-11-05 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-1926:
-

Attachment: SOLR-1926.patch

The first draft patch. It implements only hl.q. I'm still working for hl.text 
that was suggested by Hoss.


 add hl.q parameter
 --

 Key: SOLR-1926
 URL: https://issues.apache.org/jira/browse/SOLR-1926
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Priority: Trivial
 Attachments: SOLR-1926.patch


 If hl.q parameter is set, HighlightComponent uses it rather than q.
 Use case:
 You search PC with highlight and facet capability:
 {code}
 q=PC
 facet=onfacet.field=makerfacet.field=something
 hl=onhl.fl=desc
 {code}
 You get a lot of results with snippets (term PC highlighted in desc field). 
 Then you click a link maker:DELL(50) to narrow the result:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc
 {code}
 You'll get narrowed result with term PC highlighted snippets. But, 
 sometimes I'd like to see DELL to be highlighted as well, because I clicked 
 DELL. In this case, hl.q can be used:
 {code}
 q=PC
 facet=onfacet.field=something
 fq=maker:DELL
 hl=onhl.fl=desc*hl.q=PC+maker:DELL*
 {code}
 Note that hl.requireFieldMatch should be false (false is default) in this 
 scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-24 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3440:
---

Attachment: LUCENE-3440.patch

New patch, still has failures in test, though.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-17 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3440:
---

Attachment: (was: LUCENE-3440.patch)

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2837) listing supported languages

2011-10-15 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2837:
-

Attachment: SOLR-2837.patch

Draft patch that is not ideal version. This is uncool but works for me today.

{code}
$ cd contrib/langid
$ ant -emacs list-supported-lang
list-supported-lang:
da
is
it
no
hu
th
de
el
fi
pt
pl
sv
fr
en
ru
et
es
nl

BUILD SUCCESSFUL
Total time: 0 seconds
{code}


 listing supported languages
 ---

 Key: SOLR-2837
 URL: https://issues.apache.org/jira/browse/SOLR-2837
 Project: Solr
  Issue Type: Improvement
  Components: contrib - LangId
Affects Versions: 3.5, 4.0
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2837.patch


 As a user of langid, I'd like to know which languages are supported by 
 current langid, ideally via admin gui.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2838) use preferable default for langid.idField

2011-10-15 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2838:
-

Attachment: SOLR-2838.patch

 use preferable default for langid.idField
 -

 Key: SOLR-2838
 URL: https://issues.apache.org/jira/browse/SOLR-2838
 Project: Solr
  Issue Type: Improvement
  Components: contrib - LangId
Affects Versions: 3.5, 4.0
Reporter: Koji Sekiguchi
Priority: Trivial
 Attachments: SOLR-2838.patch


 langid.idField is used for logging purpose in langid. If it is not set, id 
 is set as default. But if no id field is there and the parameter is likely 
 hidden and therefore indiscernible for users, those users got undesirable 
 warnings in the log:
 {noformat}
 WARNING: Document *null* does not contain input field subject. Skipping this 
 field.
 {noformat}
 As we can access IndexSchema in initParams(), why don't we use uniqueKey 
 field as the default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3513) Add SimpleFragListBuilder constructor with margin parameter

2011-10-15 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3513:
---

Affects Version/s: (was: 3.4)
   2.9
Fix Version/s: 4.0
   3.5
 Assignee: Koji Sekiguchi

 Add SimpleFragListBuilder constructor with margin parameter
 ---

 Key: LUCENE-3513
 URL: https://issues.apache.org/jira/browse/LUCENE-3513
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 2.9
Reporter: Kelsey Francis
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3513.patch, LUCENE-3513.patch


 {{SimpleFragListBuilder}} would benefit from an additional constructor that 
 takes in {{margin}}. Currently, the margin is defined as a constant, so to 
 implement a {{FragListBuilder}} with a different margin, one has no choice 
 but to copy and paste {{SimpleFragListBuilder}} into a new class that must be 
 placed in the {{org.apache.lucene.search.vectorhighlight}} package due to 
 accesses of package-protected fields in other classes.
 If this change were made, the precondition check of the constructor's 
 {{fragCharSize}} should probably be altered to ensure that it's less than 
 {{max(1, margin*3)}} to allow for a margin of 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3513) Add SimpleFragListBuilder constructor with margin parameter

2011-10-15 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3513:
---

Attachment: LUCENE-3513.patch

Updated patch attached. I needed to update test. I've also changed the 
visibility of the new member variable to package default for the test.

I'll commit soon.

 Add SimpleFragListBuilder constructor with margin parameter
 ---

 Key: LUCENE-3513
 URL: https://issues.apache.org/jira/browse/LUCENE-3513
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 2.9
Reporter: Kelsey Francis
Priority: Minor
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3513.patch, LUCENE-3513.patch


 {{SimpleFragListBuilder}} would benefit from an additional constructor that 
 takes in {{margin}}. Currently, the margin is defined as a constant, so to 
 implement a {{FragListBuilder}} with a different margin, one has no choice 
 but to copy and paste {{SimpleFragListBuilder}} into a new class that must be 
 placed in the {{org.apache.lucene.search.vectorhighlight}} package due to 
 accesses of package-protected fields in other classes.
 If this change were made, the precondition check of the constructor's 
 {{fragCharSize}} should probably be altered to ensure that it's less than 
 {{max(1, margin*3)}} to allow for a margin of 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-14 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-3440:
---

Attachment: LUCENE-3440.patch

In this patch, I removed FieldFragList interface and renamed BaseFieldFragList 
to FieldFragList, and moved javadocs to the abstract from interface.

I'm still working.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1895) ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time

2011-09-28 Thread Koji Sekiguchi (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-1895:
-

Attachment: SOLR-1895-queries.patch

I fixed the patch for distributed search. I also modified dot.classpath file in 
this patch.

 ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search 
 time
 --

 Key: SOLR-1895
 URL: https://issues.apache.org/jira/browse/SOLR-1895
 Project: Solr
  Issue Type: New Feature
  Components: SearchComponents - other
Reporter: Karl Wright
  Labels: document, security, solr
 Fix For: 3.5, 4.0

 Attachments: LCFSecurityFilter.java, LCFSecurityFilter.java, 
 LCFSecurityFilter.java, LCFSecurityFilter.java, SOLR-1895-queries.patch, 
 SOLR-1895-queries.patch, SOLR-1895-queries.patch, SOLR-1895-queries.patch, 
 SOLR-1895-queries.patch, SOLR-1895-service-plugin.patch, 
 SOLR-1895-service-plugin.patch, SOLR-1895.patch, SOLR-1895.patch, 
 SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch


 I've written an LCF SearchComponent which filters returned results based on 
 access tokens provided by LCF's authority service.  The component requires 
 you to configure the appropriate authority service URL base, e.g.:
   !-- LCF document security enforcement component --
   searchComponent name=lcfSecurity class=LCFSecurityFilter
 str 
 name=AuthorityServiceBaseURLhttp://localhost:8080/lcf-authority-service/str
   /searchComponent
 Also required are the following schema.xml additions:
!-- Security fields --
field name=allow_token_document type=string indexed=true 
 stored=false multiValued=true/
field name=deny_token_document type=string indexed=true 
 stored=false multiValued=true/
field name=allow_token_share type=string indexed=true 
 stored=false multiValued=true/
field name=deny_token_share type=string indexed=true stored=false 
 multiValued=true/
 Finally, to tie it into the standard request handler, it seems to need to run 
 last:
   requestHandler name=standard class=solr.SearchHandler default=true
 arr name=last-components
   strlcfSecurity/str
 /arr
 ...
 I have not set a package for this code.  Nor have I been able to get it 
 reviewed by someone as conversant with Solr as I would prefer.  It is my 
 hope, however, that this module will become part of the standard Solr 1.5 
 suite of search components, since that would tie it in with LCF nicely.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

[jira] [Updated] (SOLR-2418) remove deprecated highlighting/ syntax

[jira] [Updated] (SOLR-2202) Money/Currency FieldType

[jira] [Updated] (SOLR-2909) Add support for ResourceLoaderAware tokenizerFactories in synonym filter factories

[jira] [Updated] (SOLR-3055) Use NGramPhraseQuery in Solr

[jira] [Updated] (LUCENE-3698) FastVectorHighlighter adds a multi value separator (space) to the end of the highlighted text

[jira] [Updated] (LUCENE-3698) FastVectorHighlighter adds a multi value separator (space) to the end of the highlighted text

[jira] [Updated] (LUCENE-3697) FastVectorHighlighter SimpleBoundaryScanner does not work well when highlighting at the beginning of the text

[jira] [Updated] (SOLR-3012) SimplePostTool: move getProperty(type) out of postData()

[jira] [Updated] (SOLR-3012) SimplePostTool: move getProperty(type) out of postData()

[jira] [Updated] (SOLR-3012) SimplePostTool: move getProperty(type) out of postData()

[jira] [Updated] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

[jira] [Updated] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

[jira] [Updated] (LUCENE-3640) remove IndexSearcher.close

[jira] [Updated] (SOLR-2922) Upgrade commons io and lang in Solr

[jira] [Updated] (LUCENE-2949) FastVectorHighlighter FieldTermStack could likely benefit from using TermVectorMapper

[jira] [Updated] (SOLR-1926) add hl.q parameter

[jira] [Updated] (SOLR-1926) add hl.q parameter

[jira] [Updated] (SOLR-1926) add hl.q parameter

[jira] [Updated] (SOLR-1926) add hl.q parameter

[jira] [Updated] (SOLR-1926) add hl.q parameter

[jira] [Updated] (SOLR-1926) add hl.q parameter

[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

[jira] [Updated] (SOLR-2837) listing supported languages

[jira] [Updated] (SOLR-2838) use preferable default for langid.idField

[jira] [Updated] (LUCENE-3513) Add SimpleFragListBuilder constructor with margin parameter

[jira] [Updated] (LUCENE-3513) Add SimpleFragListBuilder constructor with margin parameter

[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

[jira] [Updated] (SOLR-1895) ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time

32 matches

Site Navigation

Mail list logo

Footer information