from:"Tim Allison \(JIRA\)"

[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available

2018-10-25 Thread Tim Allison (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664112#comment-16664112
 ] 

Tim Allison commented on SOLR-12423:


Thank you [~ctargett]!

> Upgrade to Tika 1.19.1 when available
> -
>
> Key: SOLR-12423
> URL: https://issues.apache.org/jira/browse/SOLR-12423
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
>Priority: Major
> Fix For: 7.6, master (8.0)
>
> Attachments: SOLR-12423.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In Tika 1.19, there will be the ability to call the ForkParser and specify a 
> directory of jars from which to load the classes for the Parser in the child 
> processes. This will allow us to remove all of the parser dependencies from 
> Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar 
> in the child process’ bin directory and be done with the upgrade... no more 
> fiddly dependency upgrades and threat of jar hell.
> The ForkParser also protects against ooms, infinite loops and jvm crashes. 
> W00t!
> This issue covers the basic upgrading to 1.19.1.  For the migration to the 
> ForkParser, see SOLR-11721.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available

2018-10-17 Thread Tim Allison (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653918#comment-16653918
 ] 

Tim Allison commented on SOLR-12423:


W00t!  Thank you, [~erickerickson]!

> Upgrade to Tika 1.19.1 when available
> -
>
> Key: SOLR-12423
> URL: https://issues.apache.org/jira/browse/SOLR-12423
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
>Priority: Major
> Fix For: 7.6, master (8.0)
>
> Attachments: SOLR-12423.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In Tika 1.19, there will be the ability to call the ForkParser and specify a 
> directory of jars from which to load the classes for the Parser in the child 
> processes. This will allow us to remove all of the parser dependencies from 
> Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar 
> in the child process’ bin directory and be done with the upgrade... no more 
> fiddly dependency upgrades and threat of jar hell.
> The ForkParser also protects against ooms, infinite loops and jvm crashes. 
> W00t!
> This issue covers the basic upgrading to 1.19.1.  For the migration to the 
> ForkParser, see SOLR-11721.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available

2018-10-17 Thread Tim Allison (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653612#comment-16653612
 ] 

Tim Allison commented on SOLR-12423:


Y, I know...bummed I couldn't attend this year.  No rush on my part.  Thank you!

> Upgrade to Tika 1.19.1 when available
> -
>
> Key: SOLR-12423
> URL: https://issues.apache.org/jira/browse/SOLR-12423
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In Tika 1.19, there will be the ability to call the ForkParser and specify a 
> directory of jars from which to load the classes for the Parser in the child 
> processes. This will allow us to remove all of the parser dependencies from 
> Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar 
> in the child process’ bin directory and be done with the upgrade... no more 
> fiddly dependency upgrades and threat of jar hell.
> The ForkParser also protects against ooms, infinite loops and jvm crashes. 
> W00t!
> This issue covers the basic upgrading to 1.19.1.  For the migration to the 
> ForkParser, see SOLR-11721.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available

2018-10-17 Thread Tim Allison (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653585#comment-16653585
 ] 

Tim Allison commented on SOLR-12423:


Would a Solr committer be willing to help with this?  

Tika 1.19.1 fixes ~8 oom/infinite loop vulnerabilities: 
https://tika.apache.org/security.html 

> Upgrade to Tika 1.19.1 when available
> -
>
> Key: SOLR-12423
> URL: https://issues.apache.org/jira/browse/SOLR-12423
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In Tika 1.19, there will be the ability to call the ForkParser and specify a 
> directory of jars from which to load the classes for the Parser in the child 
> processes. This will allow us to remove all of the parser dependencies from 
> Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar 
> in the child process’ bin directory and be done with the upgrade... no more 
> fiddly dependency upgrades and threat of jar hell.
> The ForkParser also protects against ooms, infinite loops and jvm crashes. 
> W00t!
> This issue covers the basic upgrading to 1.19.1.  For the migration to the 
> ForkParser, see SOLR-11721.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available

2018-10-11 Thread Tim Allison (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646796#comment-16646796
 ] 

Tim Allison commented on SOLR-12423:


I tested PR#468 against the ~650 unit test docs within Tika's project, and 
found no surprises.

> Upgrade to Tika 1.19.1 when available
> -
>
> Key: SOLR-12423
> URL: https://issues.apache.org/jira/browse/SOLR-12423
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In Tika 1.19, there will be the ability to call the ForkParser and specify a 
> directory of jars from which to load the classes for the Parser in the child 
> processes. This will allow us to remove all of the parser dependencies from 
> Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar 
> in the child process’ bin directory and be done with the upgrade... no more 
> fiddly dependency upgrades and threat of jar hell.
> The ForkParser also protects against ooms, infinite loops and jvm crashes. 
> W00t!
> This issue covers the basic upgrading to 1.19.1.  For the migration to the 
> ForkParser, see SOLR-11721.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12423) Upgrade to Tika 1.19.1 when available

2018-10-11 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-12423:
---
Description: 
In Tika 1.19, there will be the ability to call the ForkParser and specify a 
directory of jars from which to load the classes for the Parser in the child 
processes. This will allow us to remove all of the parser dependencies from 
Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar in 
the child process’ bin directory and be done with the upgrade... no more fiddly 
dependency upgrades and threat of jar hell.

The ForkParser also protects against ooms, infinite loops and jvm crashes. W00t!

This issue covers the basic upgrading to 1.19.1.  For the migration to the 
ForkParser, see SOLR-11721.

  was:
In Tika 1.19, there will be the ability to call the ForkParser and specify a 
directory of jars from which to load the classes for the Parser in the child 
processes. This will allow us to remove all of the parser dependencies from 
Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar in 
the child process’ bin directory and be done with the upgrade... no more fiddly 
dependency upgrades and threat of jar hell.

 

The ForkParser also protects against ooms, infinite loops and jvm crashes. W00t!


> Upgrade to Tika 1.19.1 when available
> -
>
> Key: SOLR-12423
> URL: https://issues.apache.org/jira/browse/SOLR-12423
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In Tika 1.19, there will be the ability to call the ForkParser and specify a 
> directory of jars from which to load the classes for the Parser in the child 
> processes. This will allow us to remove all of the parser dependencies from 
> Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar 
> in the child process’ bin directory and be done with the upgrade... no more 
> fiddly dependency upgrades and threat of jar hell.
> The ForkParser also protects against ooms, infinite loops and jvm crashes. 
> W00t!
> This issue covers the basic upgrading to 1.19.1.  For the migration to the 
> ForkParser, see SOLR-11721.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12423) Upgrade to Tika 1.19.1 when available

2018-10-11 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-12423:
---
Summary: Upgrade to Tika 1.19.1 when available  (was: Upgrade to Tika 
1.19.1 when available and refactor to use the ForkParser)

> Upgrade to Tika 1.19.1 when available
> -
>
> Key: SOLR-12423
> URL: https://issues.apache.org/jira/browse/SOLR-12423
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In Tika 1.19, there will be the ability to call the ForkParser and specify a 
> directory of jars from which to load the classes for the Parser in the child 
> processes. This will allow us to remove all of the parser dependencies from 
> Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar 
> in the child process’ bin directory and be done with the upgrade... no more 
> fiddly dependency upgrades and threat of jar hell.
>  
> The ForkParser also protects against ooms, infinite loops and jvm crashes. 
> W00t!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-12034) Replace TokenizerChain in Solr with Lucene's CustomAnalyzer

2018-10-02 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved SOLR-12034.

Resolution: Won't Fix

I can't see a way to implement this without wrecking the API for 
CustomAnalyzer's Builder().  Please re-open if there's a clean way to do this.

> Replace TokenizerChain in Solr with Lucene's CustomAnalyzer
> ---
>
> Key: SOLR-12034
> URL: https://issues.apache.org/jira/browse/SOLR-12034
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: David Smiley
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Solr's TokenizerChain was created before Lucene's CustomAnalyzer was added, 
> and it duplicates much of CustomAnalyzer.  Let's consider refactoring to 
> remove TokenizerChain.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available and refactor to use the ForkParser

2018-10-01 Thread Tim Allison (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633961#comment-16633961
 ] 

Tim Allison commented on SOLR-12423:


Tika 1.19 fixed a number of vulnerabilities 
(https://tika.apache.org/security.html), but it has some issues.  We should 
wait for 1.19.1. We'll be rolling rc2 as soon as PDFBox 2.0.12 is available, 
and the voting for PDFBox 2.0.12 should start today.

 

 

> Upgrade to Tika 1.19.1 when available and refactor to use the ForkParser
> 
>
> Key: SOLR-12423
> URL: https://issues.apache.org/jira/browse/SOLR-12423
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In Tika 1.19, there will be the ability to call the ForkParser and specify a 
> directory of jars from which to load the classes for the Parser in the child 
> processes. This will allow us to remove all of the parser dependencies from 
> Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar 
> in the child process’ bin directory and be done with the upgrade... no more 
> fiddly dependency upgrades and threat of jar hell.
>  
> The ForkParser also protects against ooms, infinite loops and jvm crashes. 
> W00t!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12423) Upgrade to Tika 1.19.1 when available and refactor to use the ForkParser

2018-10-01 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-12423:
---
Summary: Upgrade to Tika 1.19.1 when available and refactor to use the 
ForkParser  (was: Upgrade to Tika 1.19 when available and refactor to use the 
ForkParser)

> Upgrade to Tika 1.19.1 when available and refactor to use the ForkParser
> 
>
> Key: SOLR-12423
> URL: https://issues.apache.org/jira/browse/SOLR-12423
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In Tika 1.19, there will be the ability to call the ForkParser and specify a 
> directory of jars from which to load the classes for the Parser in the child 
> processes. This will allow us to remove all of the parser dependencies from 
> Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar 
> in the child process’ bin directory and be done with the upgrade... no more 
> fiddly dependency upgrades and threat of jar hell.
>  
> The ForkParser also protects against ooms, infinite loops and jvm crashes. 
> W00t!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12551) Upgrade to Tika 1.18

2018-07-13 Thread Tim Allison (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543764#comment-16543764
 ] 

Tim Allison commented on SOLR-12551:


Yes. Why, yes I do. Thank you!


> Upgrade to Tika 1.18
> 
>
> Key: SOLR-12551
> URL: https://issues.apache.org/jira/browse/SOLR-12551
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Until 1.19 is ready (SOLR-12423), let's upgrade to 1.18.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12551) Upgrade to Tika 1.18

2018-07-13 Thread Tim Allison (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543732#comment-16543732
 ] 

Tim Allison commented on SOLR-12551:


I did the full integration tests with this against all of Tika's test files 
(with ucar files removed).

> Upgrade to Tika 1.18
> 
>
> Key: SOLR-12551
> URL: https://issues.apache.org/jira/browse/SOLR-12551
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Until 1.19 is ready (SOLR-12423), let's upgrade to 1.18.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-12551) Upgrade to Tika 1.18

2018-07-13 Thread Tim Allison (JIRA)

Tim Allison created SOLR-12551:
--

 Summary: Upgrade to Tika 1.18
 Key: SOLR-12551
 URL: https://issues.apache.org/jira/browse/SOLR-12551
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Tim Allison


Until 1.19 is ready (SOLR-12423), let's upgrade to 1.18.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12423) Upgrade to Tika 1.19 when available and refactor to use the ForkParser

2018-05-30 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-12423:
---
Environment: (was: in Tika 1.19)

> Upgrade to Tika 1.19 when available and refactor to use the ForkParser
> --
>
> Key: SOLR-12423
> URL: https://issues.apache.org/jira/browse/SOLR-12423
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Priority: Major
>
> In Tika 1.19, there will be the ability to call the ForkParser and specify a 
> directory of jars from which to load the classes for the Parser in the child 
> processes. This will allow us to remove all of the parser dependencies from 
> Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar 
> in the child process’ bin directory and be done with the upgrade... no more 
> fiddly dependency upgrades and threat of jar hell.
>  
> The ForkParser also protects against ooms, infinite loops and jvm crashes. 
> W00t!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12422) Update Ref Guide to recommend against using the ExtractingRequestHandler in production

2018-05-30 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-12422:
---
Description: 
[~elyograg] recently updated the wiki to include the hard-learned guidance that 
the ExtractingRequestHandler should not be used in production. [~ctargett] 
recommended updating the reference guide instead. Let’s update the ref guide.

 

...note to self...don't open issue on tiny screen...sorry for the clutter...

> Update Ref Guide to recommend against using the ExtractingRequestHandler in 
> production
> --
>
> Key: SOLR-12422
> URL: https://issues.apache.org/jira/browse/SOLR-12422
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Priority: Major
>
> [~elyograg] recently updated the wiki to include the hard-learned guidance 
> that the ExtractingRequestHandler should not be used in production. 
> [~ctargett] recommended updating the reference guide instead. Let’s update 
> the ref guide.
>  
> ...note to self...don't open issue on tiny screen...sorry for the clutter...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12422) Update Ref Guide to recommend against using the ExtractingRequestHandler in production

2018-05-30 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-12422:
---
Environment: (was: Shawn Heisey recently updated the wiki to include 
the hard-learned guidance that the ExtractingRequestHandler should not be used 
in production. Cassandra Targett recommended updating the reference guide 
instead. Let’s update the ref guide.)

> Update Ref Guide to recommend against using the ExtractingRequestHandler in 
> production
> --
>
> Key: SOLR-12422
> URL: https://issues.apache.org/jira/browse/SOLR-12422
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-12423) Upgrade to Tika 1.19 when available and refactor to use the ForkParser

2018-05-29 Thread Tim Allison (JIRA)

Tim Allison created SOLR-12423:
--

 Summary: Upgrade to Tika 1.19 when available and refactor to use 
the ForkParser
 Key: SOLR-12423
 URL: https://issues.apache.org/jira/browse/SOLR-12423
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
 Environment: in Tika 1.19
Reporter: Tim Allison


In Tika 1.19, there will be the ability to call the ForkParser and specify a 
directory of jars from which to load the classes for the Parser in the child 
processes. This will allow us to remove all of the parser dependencies from 
Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar in 
the child process’ bin directory and be done with the upgrade... no more fiddly 
dependency upgrades and threat of jar hell.

 

The ForkParser also protects against ooms, infinite loops and jvm crashes. W00t!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-12422) Update Ref Guide to recommend against using the ExtractingRequestHandler in production

2018-05-29 Thread Tim Allison (JIRA)

Tim Allison created SOLR-12422:
--

 Summary: Update Ref Guide to recommend against using the 
ExtractingRequestHandler in production
 Key: SOLR-12422
 URL: https://issues.apache.org/jira/browse/SOLR-12422
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
 Environment: Shawn Heisey recently updated the wiki to include the 
hard-learned guidance that the ExtractingRequestHandler should not be used in 
production. Cassandra Targett recommended updating the reference guide instead. 
Let’s update the ref guide.
Reporter: Tim Allison






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()

2018-03-08 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391705#comment-16391705
 ] 

Tim Allison commented on SOLR-11976:


Done. 

> TokenizerChain is overwriting, not chaining TokenFilters in normalize()
> ---
>
> Key: SOLR-11976
> URL: https://issues.apache.org/jira/browse/SOLR-11976
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Assignee: David Smiley
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  
> This doesn't currently break search because {{normalize}} is not being used 
> at the Solr level (AFAICT); rather, TextField has its own 
> {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. 
> Code as is:
> {noformat}
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> {noformat}
> The fix is simple:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()

2018-03-07 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390212#comment-16390212
 ] 

Tim Allison commented on SOLR-11976:


This issue will be moot after SOLR-12034 is in place.  The other issues (linked 
from SOLR-12034) are relevant but not blockers on this nor blocked by this.

So, until SOLR-12034 is in place, this is valid and should be ready for 7.3 
(although the PR is against master, of course).

> TokenizerChain is overwriting, not chaining TokenFilters in normalize()
> ---
>
> Key: SOLR-11976
> URL: https://issues.apache.org/jira/browse/SOLR-11976
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Assignee: David Smiley
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  
> This doesn't currently break search because {{normalize}} is not being used 
> at the Solr level (AFAICT); rather, TextField has its own 
> {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. 
> Code as is:
> {noformat}
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> {noformat}
> The fix is simple:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8193) Deprecate LowercaseTokenizer

2018-03-05 Thread Tim Allison (JIRA)

Tim Allison created LUCENE-8193:
---

 Summary: Deprecate LowercaseTokenizer
 Key: LUCENE-8193
 URL: https://issues.apache.org/jira/browse/LUCENE-8193
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/analysis
Reporter: Tim Allison


On LUCENE-8186, discussion favored deprecating and eventually removing 
LowercaseTokenizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

2018-03-05 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386042#comment-16386042
 ] 

Tim Allison edited comment on LUCENE-8186 at 3/5/18 1:19 PM:
-

[~thetaphi], it works because multiterms are normalized in {{TextField}}'s 
{{analyzeMultiTerm}}: 
https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168
 , which uses the full analyzer including the tokenizer.

AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the 
moment, which, I'm guessing, is why no one found SOLR-11976 until I did when 
refactoring my code for SOLR-5410. :)


was (Author: talli...@mitre.org):
[~thetaphi], it works because multiterms are normalized in {{TextField}}'s 
{{analyzeMultiTerm}}: 
https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168
 , which uses the full analyzer including the tokenizer.

AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the 
moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my 
code for SOLR-5410. :)

> CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms 
> --
>
> Key: LUCENE-8186
> URL: https://issues.apache.org/jira/browse/LUCENE-8186
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
> Attachments: LUCENE-8186.patch
>
>
> While working on SOLR-12034, a unit test that relied on the 
> LowerCaseTokenizerFactory failed.
> After some digging, I was able to replicate this at the Lucene level.
> Unit test:
> {noformat}
>   @Test
>   public void testLCTokenizerFactoryNormalize() throws Exception {
> Analyzer analyzer =  
> CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build();
> //fails
> assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));
> 
> //now try an integration test with the classic query parser
> QueryParser p = new QueryParser("f", analyzer);
> Query q = p.parse("Hello");
> //passes
> assertEquals(new TermQuery(new Term("f", "hello")), q);
> q = p.parse("Hello*");
> //fails
> assertEquals(new PrefixQuery(new Term("f", "hello")), q);
> q = p.parse("Hel*o");
> //fails
> assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
>   }
> {noformat}
> The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
> does not call the tokenizer, which, in the case of the LowerCaseTokenizer, 
> does the filtering work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

2018-03-05 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386042#comment-16386042
 ] 

Tim Allison edited comment on LUCENE-8186 at 3/5/18 1:18 PM:
-

[~thetaphi], it works because multiterms are normalized in {{TextField}}'s 
{{analyzeMultiTerm}}: 
https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168
 , which uses the full analyzer including the tokenizer.

AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the 
moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my 
code for SOLR-5410. :)


was (Author: talli...@mitre.org):
[~thetaphi], it works because multiterms are normalized in {{TextField}}'s 
{{analyzeMultiTerm}}: 
https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168
 , which uses the full analyzer including the tokenizer.

AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the 
moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my 
custom code. :)

> CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms 
> --
>
> Key: LUCENE-8186
> URL: https://issues.apache.org/jira/browse/LUCENE-8186
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
> Attachments: LUCENE-8186.patch
>
>
> While working on SOLR-12034, a unit test that relied on the 
> LowerCaseTokenizerFactory failed.
> After some digging, I was able to replicate this at the Lucene level.
> Unit test:
> {noformat}
>   @Test
>   public void testLCTokenizerFactoryNormalize() throws Exception {
> Analyzer analyzer =  
> CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build();
> //fails
> assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));
> 
> //now try an integration test with the classic query parser
> QueryParser p = new QueryParser("f", analyzer);
> Query q = p.parse("Hello");
> //passes
> assertEquals(new TermQuery(new Term("f", "hello")), q);
> q = p.parse("Hello*");
> //fails
> assertEquals(new PrefixQuery(new Term("f", "hello")), q);
> q = p.parse("Hel*o");
> //fails
> assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
>   }
> {noformat}
> The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
> does not call the tokenizer, which, in the case of the LowerCaseTokenizer, 
> does the filtering work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

2018-03-05 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386042#comment-16386042
 ] 

Tim Allison edited comment on LUCENE-8186 at 3/5/18 1:05 PM:
-

[~thetaphi], it works because multiterms are normalized in {{TextField}}'s 
{{analyzeMultiTerm}}: 
https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168
 , which uses the full analyzer including the tokenizer.

AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the 
moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my 
custom code. :)


was (Author: talli...@mitre.org):
[~thetaphi], it works because multiterms are normalized in {{TextField}}'s 
{{analyzeMultiTerm}}: 
https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168
 , which uses the full analyzer including the tokenizer.

AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the 
moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my 
custom code.

> CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms 
> --
>
> Key: LUCENE-8186
> URL: https://issues.apache.org/jira/browse/LUCENE-8186
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
> Attachments: LUCENE-8186.patch
>
>
> While working on SOLR-12034, a unit test that relied on the 
> LowerCaseTokenizerFactory failed.
> After some digging, I was able to replicate this at the Lucene level.
> Unit test:
> {noformat}
>   @Test
>   public void testLCTokenizerFactoryNormalize() throws Exception {
> Analyzer analyzer =  
> CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build();
> //fails
> assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));
> 
> //now try an integration test with the classic query parser
> QueryParser p = new QueryParser("f", analyzer);
> Query q = p.parse("Hello");
> //passes
> assertEquals(new TermQuery(new Term("f", "hello")), q);
> q = p.parse("Hello*");
> //fails
> assertEquals(new PrefixQuery(new Term("f", "hello")), q);
> q = p.parse("Hel*o");
> //fails
> assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
>   }
> {noformat}
> The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
> does not call the tokenizer, which, in the case of the LowerCaseTokenizer, 
> does the filtering work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

2018-03-05 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386042#comment-16386042
 ] 

Tim Allison commented on LUCENE-8186:
-

[~thetaphi], it works because multiterms are normalized in {{TextField}}'s 
{{analyzeMultiTerm}}: 
https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168
 , which uses the full analyzer including the tokenizer.

AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the 
moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my 
custom code.

> CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms 
> --
>
> Key: LUCENE-8186
> URL: https://issues.apache.org/jira/browse/LUCENE-8186
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
> Attachments: LUCENE-8186.patch
>
>
> While working on SOLR-12034, a unit test that relied on the 
> LowerCaseTokenizerFactory failed.
> After some digging, I was able to replicate this at the Lucene level.
> Unit test:
> {noformat}
>   @Test
>   public void testLCTokenizerFactoryNormalize() throws Exception {
> Analyzer analyzer =  
> CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build();
> //fails
> assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));
> 
> //now try an integration test with the classic query parser
> QueryParser p = new QueryParser("f", analyzer);
> Query q = p.parse("Hello");
> //passes
> assertEquals(new TermQuery(new Term("f", "hello")), q);
> q = p.parse("Hello*");
> //fails
> assertEquals(new PrefixQuery(new Term("f", "hello")), q);
> q = p.parse("Hel*o");
> //fails
> assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
>   }
> {noformat}
> The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
> does not call the tokenizer, which, in the case of the LowerCaseTokenizer, 
> does the filtering work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12048) Cannot index formatted mail

2018-03-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382640#comment-16382640
 ] 

Tim Allison commented on SOLR-12048:


Sorry...didn't realize MailEntityProcessor is not using Tika for the main body 
processing...looking through MEP now...

> Cannot index formatted mail
> ---
>
> Key: SOLR-12048
> URL: https://issues.apache.org/jira/browse/SOLR-12048
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.1
>Reporter: Dimitris
>Priority: Major
> Attachments: index_no_content.txt, index_success.txt
>
>
> Using /example/example-DIH/solr/mail/ configuration, a gmail mailbox has been 
> indexed. Nevertheless, only plain text mails are indexed. Formatted content 
> is not indexed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-12048) Cannot index formatted mail

2018-03-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382624#comment-16382624
 ] 

Tim Allison edited comment on SOLR-12048 at 3/1/18 8:55 PM:


Or probably u...@tika.apache.org :)

+1 to closing this issue and moving the discussion to the Solr user list.

In Tika <=1.17, these alternate bodies were treated as attachments, and we've 
fixed this for 1.18.

Make sure to change {{processAttachement}} to true if you haven't!

from {{mail-data-config.xml}}
{noformat}
 
{noformat}


was (Author: talli...@mitre.org):
Or probably u...@tika.apache.org :)

In Tika <=1.17, these alternate bodies were treated as attachments, and we've 
fixed this for 1.18.

Make sure to change {{processAttachement}} to true if you haven't!

from {{mail-data-config.xml}}
{noformat}
 
{noformat}

> Cannot index formatted mail
> ---
>
> Key: SOLR-12048
> URL: https://issues.apache.org/jira/browse/SOLR-12048
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.1
>Reporter: Dimitris
>Priority: Major
> Attachments: index_no_content.txt, index_success.txt
>
>
> Using /example/example-DIH/solr/mail/ configuration, a gmail mailbox has been 
> indexed. Nevertheless, only plain text mails are indexed. Formatted content 
> is not indexed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12048) Cannot index formatted mail

2018-03-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382624#comment-16382624
 ] 

Tim Allison commented on SOLR-12048:


Or probably u...@tika.apache.org :)

In Tika <=1.17, these alternate bodies were treated as attachments, and we've 
fixed this for 1.18.

Make sure to change {{processAttachement}} to true if you haven't!

from {{mail-data-config.xml}}
{noformat}
 
{noformat}

> Cannot index formatted mail
> ---
>
> Key: SOLR-12048
> URL: https://issues.apache.org/jira/browse/SOLR-12048
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.1
>Reporter: Dimitris
>Priority: Major
> Attachments: index_no_content.txt, index_success.txt
>
>
> Using /example/example-DIH/solr/mail/ configuration, a gmail mailbox has been 
> indexed. Nevertheless, only plain text mails are indexed. Formatted content 
> is not indexed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12035) ExtendedDismaxQParser fails to include charfilters in nostopanalyzer

2018-02-26 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-12035:
---
Affects Version/s: master (8.0)

> ExtendedDismaxQParser fails to include charfilters in nostopanalyzer
> 
>
> Key: SOLR-12035
> URL: https://issues.apache.org/jira/browse/SOLR-12035
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some circumstances, the ExtendedDismaxQParser tries to remove stop filters 
> from the TokenizerChain.  When building the new analyzer without the stop 
> filters, the charfilters from the original TokenizerChain are not copied over.
> The fix is trivial.
> {noformat}
> -  TokenizerChain newa = new TokenizerChain(tcq.getTokenizerFactory(), 
> newtf);
> + TokenizerChain newa = new TokenizerChain(tcq.getCharFilterFactories(), 
> tcq.getTokenizerFactory(), newtf);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12035) ExtendedDismaxQParser fails to include charfilters in nostopanalyzer

2018-02-26 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-12035:
---
Component/s: query parsers

> ExtendedDismaxQParser fails to include charfilters in nostopanalyzer
> 
>
> Key: SOLR-12035
> URL: https://issues.apache.org/jira/browse/SOLR-12035
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some circumstances, the ExtendedDismaxQParser tries to remove stop filters 
> from the TokenizerChain.  When building the new analyzer without the stop 
> filters, the charfilters from the original TokenizerChain are not copied over.
> The fix is trivial.
> {noformat}
> -  TokenizerChain newa = new TokenizerChain(tcq.getTokenizerFactory(), 
> newtf);
> + TokenizerChain newa = new TokenizerChain(tcq.getCharFilterFactories(), 
> tcq.getTokenizerFactory(), newtf);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-12035) ExtendedDismaxQParser fails to include charfilters in nostopanalyzer

2018-02-26 Thread Tim Allison (JIRA)

Tim Allison created SOLR-12035:
--

 Summary: ExtendedDismaxQParser fails to include charfilters in 
nostopanalyzer
 Key: SOLR-12035
 URL: https://issues.apache.org/jira/browse/SOLR-12035
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Tim Allison


In some circumstances, the ExtendedDismaxQParser tries to remove stop filters 
from the TokenizerChain.  When building the new analyzer without the stop 
filters, the charfilters from the original TokenizerChain are not copied over.

The fix is trivial.
{noformat}
-  TokenizerChain newa = new TokenizerChain(tcq.getTokenizerFactory(), 
newtf);
+ TokenizerChain newa = new TokenizerChain(tcq.getCharFilterFactories(), 
tcq.getTokenizerFactory(), newtf);
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

2018-02-26 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated LUCENE-8186:

Description: 
While working on SOLR-12034, a unit test that relied on the 
LowerCaseTokenizerFactory failed.

After some digging, I was able to replicate this at the Lucene level.

Unit test:
{noformat}
  @Test
  public void testLCTokenizerFactoryNormalize() throws Exception {

Analyzer analyzer =  
CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build();

//fails
assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));

//now try an integration test with the classic query parser
QueryParser p = new QueryParser("f", analyzer);
Query q = p.parse("Hello");
//passes
assertEquals(new TermQuery(new Term("f", "hello")), q);

q = p.parse("Hello*");
//fails
assertEquals(new PrefixQuery(new Term("f", "hello")), q);

q = p.parse("Hel*o");
//fails
assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
  }
{noformat}

The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
does not call the tokenizer, which, in the case of the LowerCaseTokenizer, does 
the filtering work.

  was:
While working on SOLR-12034, a unit test that relied on the 
LowerCaseTokenizerFactory failed.

After some digging, I was able to replicate this at the Lucene level.

Unit test:
{noformat}
  @Test
  public void testLCTokenizerFactoryNormalize() throws Exception {

Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new 
LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build();

//fails
assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));

//now try an integration test with the classic query parser
QueryParser p = new QueryParser("f", analyzer);
Query q = p.parse("Hello");
//passes
assertEquals(new TermQuery(new Term("f", "hello")), q);

q = p.parse("Hello*");
//fails
assertEquals(new PrefixQuery(new Term("f", "hello")), q);

q = p.parse("Hel*o");
//fails
assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
  }
{noformat}

The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
does not call the tokenizer, which, in the case of the LowerCaseTokenizer, does 
the filtering work.


> CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms 
> --
>
> Key: LUCENE-8186
> URL: https://issues.apache.org/jira/browse/LUCENE-8186
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> While working on SOLR-12034, a unit test that relied on the 
> LowerCaseTokenizerFactory failed.
> After some digging, I was able to replicate this at the Lucene level.
> Unit test:
> {noformat}
>   @Test
>   public void testLCTokenizerFactoryNormalize() throws Exception {
> Analyzer analyzer =  
> CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build();
> //fails
> assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));
> 
> //now try an integration test with the classic query parser
> QueryParser p = new QueryParser("f", analyzer);
> Query q = p.parse("Hello");
> //passes
> assertEquals(new TermQuery(new Term("f", "hello")), q);
> q = p.parse("Hello*");
> //fails
> assertEquals(new PrefixQuery(new Term("f", "hello")), q);
> q = p.parse("Hel*o");
> //fails
> assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
>   }
> {noformat}
> The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
> does not call the tokenizer, which, in the case of the LowerCaseTokenizer, 
> does the filtering work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

2018-02-26 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated LUCENE-8186:

Description: 
While working on SOLR-12034, a unit test that relied on the 
LowerCaseTokenizerFactory failed.

After some digging, I was able to replicate this at the Lucene level.

Unit test:
{noformat}
  @Test
  public void testLCTokenizerFactoryNormalize() throws Exception {

Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new 
LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build();

//fails
assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));

//now try an integration test with the classic query parser
QueryParser p = new QueryParser("f", analyzer);
Query q = p.parse("Hello");
//passes
assertEquals(new TermQuery(new Term("f", "hello")), q);

q = p.parse("Hello*");
//fails
assertEquals(new PrefixQuery(new Term("f", "hello")), q);

q = p.parse("Hel*o");
//fails
assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
  }
{noformat}

The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
does not call the tokenizer, which, in the case of the LowerCaseTokenizer, does 
the filtering work.

  was:
While working on SOLR-12034, a unit test that relied on the 
LowerCaseTokenizerFactory failed.

After some digging, I was able to replicate this at the Lucene level.

Unit test:
{noformat}
  @Test
  public void testLCTokenizerFactoryNormalize() throws Exception {

Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new 
LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build();

//fails
assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));

//now try an integration test with the classic query parser
QueryParser p = new QueryParser("f", analyzer);
Query q = p.parse("Hello");
//passes
assertEquals(new TermQuery(new Term("f", "hello")), q);

q = p.parse("Hello*");
//fails
assertEquals(new PrefixQuery(new Term("f", "hello")), q);

q = p.parse("Hel*o");
//fails
assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
  }
{noformat}

The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
does not call the tokenizer, which, in the case of the LowerCaseAnalyzer, does 
the filtering work.


> CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms 
> --
>
> Key: LUCENE-8186
> URL: https://issues.apache.org/jira/browse/LUCENE-8186
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> While working on SOLR-12034, a unit test that relied on the 
> LowerCaseTokenizerFactory failed.
> After some digging, I was able to replicate this at the Lucene level.
> Unit test:
> {noformat}
>   @Test
>   public void testLCTokenizerFactoryNormalize() throws Exception {
> Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new 
> LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build();
> //fails
> assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));
> 
> //now try an integration test with the classic query parser
> QueryParser p = new QueryParser("f", analyzer);
> Query q = p.parse("Hello");
> //passes
> assertEquals(new TermQuery(new Term("f", "hello")), q);
> q = p.parse("Hello*");
> //fails
> assertEquals(new PrefixQuery(new Term("f", "hello")), q);
> q = p.parse("Hel*o");
> //fails
> assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
>   }
> {noformat}
> The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
> does not call the tokenizer, which, in the case of the LowerCaseTokenizer, 
> does the filtering work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

2018-02-26 Thread Tim Allison (JIRA)

Tim Allison created LUCENE-8186:
---

 Summary: CustomAnalyzer with a LowerCaseTokenizerFactory fails to 
normalize multiterms 
 Key: LUCENE-8186
 URL: https://issues.apache.org/jira/browse/LUCENE-8186
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Tim Allison


While working on SOLR-12034, a unit test that relied on the 
LowerCaseTokenizerFactory failed.

After some digging, I was able to replicate this at the Lucene level.

Unit test:
{noformat}
  @Test
  public void testLCTokenizerFactoryNormalize() throws Exception {

Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new 
LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build();

//fails
assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));

//now try an integration test with the classic query parser
QueryParser p = new QueryParser("f", analyzer);
Query q = p.parse("Hello");
//passes
assertEquals(new TermQuery(new Term("f", "hello")), q);

q = p.parse("Hello*");
//fails
assertEquals(new PrefixQuery(new Term("f", "hello")), q);

q = p.parse("Hel*o");
//fails
assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
  }
{noformat}

The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
does not call the tokenizer, which, in the case of the LowerCaseAnalyzer, does 
the filtering work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()

2018-02-26 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16376975#comment-16376975
 ] 

Tim Allison commented on SOLR-11976:


Y, I started on it...uncovered at least one other bug...it is a pretty big 
undertaking.  Opened SOLR-12034.

> TokenizerChain is overwriting, not chaining TokenFilters in normalize()
> ---
>
> Key: SOLR-11976
> URL: https://issues.apache.org/jira/browse/SOLR-11976
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Assignee: David Smiley
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  
> This doesn't currently break search because {{normalize}} is not being used 
> at the Solr level (AFAICT); rather, TextField has its own 
> {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. 
> Code as is:
> {noformat}
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> {noformat}
> The fix is simple:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-12034) Replace TokenizerChain in Solr with Lucene's CustomAnalyzer

2018-02-26 Thread Tim Allison (JIRA)

Tim Allison created SOLR-12034:
--

 Summary: Replace TokenizerChain in Solr with Lucene's 
CustomAnalyzer
 Key: SOLR-12034
 URL: https://issues.apache.org/jira/browse/SOLR-12034
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Tim Allison


Solr's TokenizerChain was created before Lucene's CustomAnalyzer was added, and 
it duplicates much of CustomAnalyzer.  Let's consider refactoring to remove 
TokenizerChain.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()

2018-02-23 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374303#comment-16374303
 ] 

Tim Allison commented on SOLR-11976:


Thank you.  I'll update the PR.  Should we also get rid of the special handling 
of multiterm analysis in TextField?  Or, separate issue?

> TokenizerChain is overwriting, not chaining TokenFilters in normalize()
> ---
>
> Key: SOLR-11976
> URL: https://issues.apache.org/jira/browse/SOLR-11976
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  
> This doesn't currently break search because {{normalize}} is not being used 
> at the Solr level (AFAICT); rather, TextField has its own 
> {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. 
> Code as is:
> {noformat}
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> {noformat}
> The fix is simple:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()

2018-02-21 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372022#comment-16372022
 ] 

Tim Allison commented on SOLR-11976:


Ping...any committer interested in this or a larger PR to swap out 
{{TokenizerChain}} for {{CustomAnalyzer}}?

> TokenizerChain is overwriting, not chaining TokenFilters in normalize()
> ---
>
> Key: SOLR-11976
> URL: https://issues.apache.org/jira/browse/SOLR-11976
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  
> This doesn't currently break search because {{normalize}} is not being used 
> at the Solr level (AFAICT); rather, TextField has its own 
> {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. 
> Code as is:
> {noformat}
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> {noformat}
> The fix is simple:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()

2018-02-14 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364010#comment-16364010
 ] 

Tim Allison commented on SOLR-11976:


Better yet, swap out Solr's {{TokenizerChain}} for Lucene's {{CustomAnalyzer}} 
and deprecate {{TokenizerChain}} in 7.x?

Happy to submit PR if a committer is willing to work with me on this.

> TokenizerChain is overwriting, not chaining TokenFilters in normalize()
> ---
>
> Key: SOLR-11976
> URL: https://issues.apache.org/jira/browse/SOLR-11976
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  
> This doesn't currently break search because {{normalize}} is not being used 
> at the Solr level (AFAICT); rather, TextField has its own 
> {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. 
> Code as is:
> {noformat}
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> {noformat}
> The fix is simple:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()

2018-02-12 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-11976:
---
Description: 
TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  

This doesn't currently break search because {{normalize}} is not being used at 
the Solr level (AFAICT); rather, TextField has its own {{analyzeMultiTerm()}} 
that duplicates code from the newer {{normalize}}. 

Code as is:
{noformat}
TokenStream result = in;
for (TokenFilterFactory filter : filters) {
  if (filter instanceof MultiTermAwareComponent) {
filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
filter).getMultiTermComponent();
result = filter.create(in);
  }
}
{noformat}

The fix is simple:

{noformat}
-result = filter.create(in);
+result = filter.create(result);
{noformat}


  was:
TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  

This doesn't currently break search because {{normalize}} is not currently 
being used at the Solr level (AFAICT); rather, TextField has its own 
{{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. 

Code as is:
{noformat}
TokenStream result = in;
for (TokenFilterFactory filter : filters) {
  if (filter instanceof MultiTermAwareComponent) {
filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
filter).getMultiTermComponent();
result = filter.create(in);
  }
}
{noformat}

The fix is simple:

{noformat}
-result = filter.create(in);
+result = filter.create(result);
{noformat}



> TokenizerChain is overwriting, not chaining TokenFilters in normalize()
> ---
>
> Key: SOLR-11976
> URL: https://issues.apache.org/jira/browse/SOLR-11976
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  
> This doesn't currently break search because {{normalize}} is not being used 
> at the Solr level (AFAICT); rather, TextField has its own 
> {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. 
> Code as is:
> {noformat}
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> {noformat}
> The fix is simple:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()

2018-02-12 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-11976:
---
Summary: TokenizerChain is overwriting, not chaining TokenFilters in 
normalize()  (was: TokenizerChain is overwriting, not chaining in normalize())

> TokenizerChain is overwriting, not chaining TokenFilters in normalize()
> ---
>
> Key: SOLR-11976
> URL: https://issues.apache.org/jira/browse/SOLR-11976
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  
> This doesn't currently break search because {{normalize}} is not currently 
> being used at the Solr level (AFAICT); rather, TextField has its own 
> {{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. 
> Code as is:
> {noformat}
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> {noformat}
> The fix is simple:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining in normalize()

2018-02-12 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361452#comment-16361452
 ] 

Tim Allison commented on SOLR-11976:


I'm happy to open a separate issue/PR to factor out {{TextField}}'s 
{{analyzeMultiTerm}} in favor of {{Analyzer#normalize()}}.

> TokenizerChain is overwriting, not chaining in normalize()
> --
>
> Key: SOLR-11976
> URL: https://issues.apache.org/jira/browse/SOLR-11976
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  
> This doesn't currently break search because {{normalize}} is not currently 
> being used at the Solr level (AFAICT); rather, TextField has its own 
> {{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. 
> Code as is:
> {noformat}
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> {noformat}
> The fix is simple:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11976) TokenizerChain is overwriting, not chaining in normalize()

2018-02-12 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-11976:
---
Priority: Minor  (was: Major)

> TokenizerChain is overwriting, not chaining in normalize()
> --
>
> Key: SOLR-11976
> URL: https://issues.apache.org/jira/browse/SOLR-11976
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  
> This doesn't currently break search because {{normalize}} is not currently 
> being used at the Solr level (AFAICT); rather, TextField has its own 
> {{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. 
> Code as is:
> {noformat}
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> {noformat}
> The fix is simple:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11976) TokenizerChain is overwriting, not chaining in normalize()

2018-02-12 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-11976:
---
Affects Version/s: (was: 7.2)
   master (8.0)

> TokenizerChain is overwriting, not chaining in normalize()
> --
>
> Key: SOLR-11976
> URL: https://issues.apache.org/jira/browse/SOLR-11976
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Major
>
> TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  
> This doesn't currently break search because {{normalize}} is not currently 
> being used at the Solr level (AFAICT); rather, TextField has its own 
> {{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. 
> Code as is:
> {noformat}
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> {noformat}
> The fix is simple:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-11976) TokenizerChain is overwriting, not chaining in normalize()

2018-02-12 Thread Tim Allison (JIRA)

Tim Allison created SOLR-11976:
--

 Summary: TokenizerChain is overwriting, not chaining in normalize()
 Key: SOLR-11976
 URL: https://issues.apache.org/jira/browse/SOLR-11976
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: search
Affects Versions: 7.2
Reporter: Tim Allison


TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}.  

This doesn't currently break search because {{normalize}} is not currently 
being used at the Solr level (AFAICT); rather, TextField has its own 
{{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. 

Code as is:
{noformat}
TokenStream result = in;
for (TokenFilterFactory filter : filters) {
  if (filter instanceof MultiTermAwareComponent) {
filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
filter).getMultiTermComponent();
result = filter.create(in);
  }
}
{noformat}

The fix is simple:

{noformat}
-result = filter.create(in);
+result = filter.create(result);
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available

2018-01-03 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310271#comment-16310271
 ] 

Tim Allison commented on SOLR-11701:


Finally back to keyboard. Doh, and thank you!!!

> Upgrade to Tika 1.17 when available
> ---
>
> Key: SOLR-11701
> URL: https://issues.apache.org/jira/browse/SOLR-11701
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
> Fix For: 7.3
>
> Attachments: SOLR-11701.patch, SOLR-11701.patch
>
>
> Kicking off release process for Tika 1.17 in the next few days.  Please let 
> us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available

2017-12-18 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295483#comment-16295483
 ] 

Tim Allison commented on SOLR-11701:


Sounds good.  _Thank you_!

On the git conflict, y, that was caused by the recent addition of opennlp.  
I've updated the PR, but there are, of course, already new conflicts! :)  Let 
me know if I can do anything to help with that. 

On the 401, I'm sure why that was happening...I'll take a look.

On the unused imports, ugh.  Thank you.

> Upgrade to Tika 1.17 when available
> ---
>
> Key: SOLR-11701
> URL: https://issues.apache.org/jira/browse/SOLR-11701
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
> Attachments: SOLR-11701.patch
>
>
> Kicking off release process for Tika 1.17 in the next few days.  Please let 
> us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available

2017-12-18 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295411#comment-16295411
 ] 

Tim Allison commented on SOLR-11701:


Back to keyboard.  You're right in all of the above. When we bump slf4j from 
1.7.7 to 1.7.24, its behavior changes to print out the full stacktrace instead 
of just the message.

In org.slf4j.helpers.MessageFormatter in 1.7.7, the exception is counted as one 
of the members of {{argArray}}, and because of the following snippet, the 
{{throwableCandidate}} is nulled out in the returned {{FormattingTuple}}
{noformat}
if (L < argArray.length - 1) {
return new FormattingTuple(sbuf.toString(), argArray, 
throwableCandidate);
} else {
return new FormattingTuple(sbuf.toString(), argArray, 
(Throwable)null);
}
{noformat}

In 1.7.24, there's an added bit of logic before we get to that location that 
removes the exception from {{argArray}} so that it can't get swept into the 
message.
{noformat}
Object[] args = argArray;
if (throwableCandidate != null) {
args = trimmedCopy(argArray);
}
{noformat}

I have in the back of my mind that there was a reason we upgraded slf4j in 
Tika.  I'll look through our git history to see when/why and if we need to do 
it for the Solr integration.


> Upgrade to Tika 1.17 when available
> ---
>
> Key: SOLR-11701
> URL: https://issues.apache.org/jira/browse/SOLR-11701
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
> Attachments: SOLR-11701.patch
>
>
> Kicking off release process for Tika 1.17 in the next few days.  Please let 
> us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available

2017-12-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294225#comment-16294225
 ] 

Tim Allison commented on SOLR-11701:


Ugh. I’m still without keyboard. Can you tell which dependency is now adding 
more stuff? Will take a look tomorrow. Thank you for making it easy for me to 
replicate.

> Upgrade to Tika 1.17 when available
> ---
>
> Key: SOLR-11701
> URL: https://issues.apache.org/jira/browse/SOLR-11701
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
> Attachments: SOLR-11701.patch
>
>
> Kicking off release process for Tika 1.17 in the next few days.  Please let 
> us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available

2017-12-16 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293930#comment-16293930
 ] 

Tim Allison commented on SOLR-11701:


Away from tools now. Will look on Monday. Thank you!

> Upgrade to Tika 1.17 when available
> ---
>
> Key: SOLR-11701
> URL: https://issues.apache.org/jira/browse/SOLR-11701
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
> Attachments: SOLR-11701.patch
>
>
> Kicking off release process for Tika 1.17 in the next few days.  Please let 
> us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available

2017-12-15 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293525#comment-16293525
 ] 

Tim Allison commented on SOLR-11701:


Y. Thank you!

> Upgrade to Tika 1.17 when available
> ---
>
> Key: SOLR-11701
> URL: https://issues.apache.org/jira/browse/SOLR-11701
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
>
> Kicking off release process for Tika 1.17 in the next few days.  Please let 
> us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available

2017-12-15 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293243#comment-16293243
 ] 

Tim Allison commented on SOLR-11701:


K.  I turned off the warnings with 
[d25349d|https://github.com/apache/lucene-solr/pull/291/commits/d25349dba44f8774683863092104fad8ea05c75d],
 and I reran the integration tests. That _should_ be ready to go.

> Upgrade to Tika 1.17 when available
> ---
>
> Key: SOLR-11701
> URL: https://issues.apache.org/jira/browse/SOLR-11701
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
>
> Kicking off release process for Tika 1.17 in the next few days.  Please let 
> us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available

2017-12-15 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292681#comment-16292681
 ] 

Tim Allison commented on SOLR-11701:


One more change... I'd like to turn off the missing jar warnings as the default 
in Solr.  Update to PR coming soon, unless that should be a different issue.

> Upgrade to Tika 1.17 when available
> ---
>
> Key: SOLR-11701
> URL: https://issues.apache.org/jira/browse/SOLR-11701
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
>
> Kicking off release process for Tika 1.17 in the next few days.  Please let 
> us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available

2017-12-14 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291955#comment-16291955
 ] 

Tim Allison commented on SOLR-11701:


Yes, and please.  Thank you!

> Upgrade to Tika 1.17 when available
> ---
>
> Key: SOLR-11701
> URL: https://issues.apache.org/jira/browse/SOLR-11701
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Erick Erickson
>
> Kicking off release process for Tika 1.17 in the next few days.  Please let 
> us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-14 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291937#comment-16291937
 ] 

Tim Allison commented on SOLR-11622:


Turns out I did because you had done most of the work! :)

See https://github.com/apache/lucene-solr/pull/291 over on SOLR-11701.

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch, SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at 
>

[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available

2017-12-14 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291935#comment-16291935
 ] 

Tim Allison commented on SOLR-11701:


I merged [~kramachand...@commvault.com]'s mods and made a few updates for Tika 
1.17.

I ran an integration test against 643 files in Apache Tika's unit test docs, 
and I got the same # of documents indexed in Solr as tika-app.jar parsed 
without exceptions.

{noformat}
public static void main(String[] args) throws Exception {
Path extracts = Paths.get("C:\\data\\tika_unit_tests_extracts");
SolrClient client = new 
HttpSolrClient.Builder("http://localhost:8983/solr/fileupload_passt/;).build();
for (File f : extracts.toFile().listFiles()) {
try (Reader r = Files.newBufferedReader(f.toPath(), 
StandardCharsets.UTF_8)) {
List metadataList = JsonMetadataList.fromJson(r);
String ex = 
metadataList.get(0).get(TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX + 
"runtime");
if (ex == null) {
SolrQuery q = new SolrQuery("id: 
"+f.getName().replace(".json", ""));
QueryResponse response = client.query(q);
SolrDocumentList results = response.getResults();
if (results.getNumFound() != 1) {
System.err.println(f.getName() + " " + 
results.getNumFound());
}
}
}
}
}
{noformat}

I did the usual dance:
{noformat}
ant clean-jars jar-checksums
ant precommit
{noformat}

[~erickerickson], this _should_ be good to go.  


> Upgrade to Tika 1.17 when available
> ---
>
> Key: SOLR-11701
> URL: https://issues.apache.org/jira/browse/SOLR-11701
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>
> Kicking off release process for Tika 1.17 in the next few days.  Please let 
> us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-14 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291483#comment-16291483
 ] 

Tim Allison commented on SOLR-11622:


[~kramachand...@commvault.com], if it is ok with you and if I have time, I'll 
try to submit a PR on SOLR-11701.  If I don't have time, it will be all yours 
after you return. :)  Sound good...or do you want the glory?

For the last integration test I did, I put [these 
documents|https://github.com/apache/tika/tree/master/tika-parsers/src/test/resources/test-documents]
 in a directory and ran tika-app.jar against them.  I then ran tika-eval.jar 
and counted the number of files without exceptions to get a ground truth count 
of how many files I'd expect to be in Solr.

I then used DIH to import the same directory, with skip on error, and made sure 
there were the same # of documents in Solr.  This uncovered several problems, 
which we'll fix in this issue or SOLR-11701.  

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch, SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-14 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291138#comment-16291138
 ] 

Tim Allison commented on SOLR-11622:


Sorry, right, yes, please and thank you.  The question is whether Karthik wants 
to do a comprehensive upgrade to Tika 1.17 PR or whether I should...either way, 
with you, [~erickerickson] as the reviewer+committer.

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch, SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
>

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-14 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291128#comment-16291128
 ] 

Tim Allison commented on SOLR-11622:


Thank you [~erickerickson]!

Y, SOLR-11701 with [~kramachand...@commvault.com]'s fixes here could be unified 
into one PR that would upgrade us to Tika 1.17 and would fix numerous 
dependency problems that I found when I finally did an integration test with 
Tika's test files 
[above|https://issues.apache.org/jira/browse/SOLR-11622?focusedCommentId=16277347=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16277347].

This single PR would close out this issue, the SOLR-11693 and SOLR-11701 _and_ 
clean up problems I haven't even opened issues for (msaccess, and ...)

[~kramachand...@commvault.com], would you like to have a go at SOLR-11701, 
plagiarizing my notes, or should I plagiarize your work for SOLR-11701.

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch, SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
>

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-14 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291082#comment-16291082
 ] 

Tim Allison commented on SOLR-11622:


I'm not a committer on Lucene/Solr so I can't help.  Sorry.  

Now that Tika 1.17 is out, it would be great to get that fully integrated, to 
include your fixes (SOLR-11701)...especially because this would fix a nasty 
regression that prevents pptx files with tables from getting indexed 
(SOLR-11693).

[~shalinmangar] or [~thetaphi], if [~kramachand...@commvault.com] or I put 
together a PR for SOLR-11701, would you be willing to review and commit?

This time, I'll run DIH against Tika's unit test documents before making the 
PR... 

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch, SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277359#comment-16277359
 ] 

Tim Allison commented on SOLR-11622:


My {{ant-precommit}} had the usual build failure with broken links...So, I 
think we're good. :)

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch, SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at 
>

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277347#comment-16277347
 ] 

Tim Allison commented on SOLR-11622:


Smh...that we haven't run Solr against Tika's test files before/recently.  This 
would have surfaced SOLR-11693.  Unit tests would not have found that, but a 
full integration test would have. :(

Speaking of which, with ref to 
[this|https://issues.apache.org/jira/browse/SOLR-11622?focusedCommentId=16274648=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16274648],
 I'm still getting the CTTable xsb error on our {{testPPT_various.pptx}}, and 
you can't just do a drop and replace POI-3.17-beta1 with POI-3.17, because 
there's a binary conflict on wmf files.  That fix will require the upgrade to 
Tika 1.17, which should be on the way.  I'm guessing that you aren't seeing 
that because of the luck of your classloader?

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch, SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
>

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277232#comment-16277232
 ] 

Tim Allison commented on SOLR-11622:


Finished analysis.  Will submit PR to against your branch shortly.  Working on 
{{ant precommit}} now.

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch, SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at 
>

[jira] [Comment Edited] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277232#comment-16277232
 ] 

Tim Allison edited comment on SOLR-11622 at 12/4/17 6:43 PM:
-

Finished analysis.  Will submit PR against your branch shortly.  Working on 
{{ant precommit}} now.


was (Author: talli...@mitre.org):
Finished analysis.  Will submit PR to against your branch shortly.  Working on 
{{ant precommit}} now.

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch, SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
>

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276976#comment-16276976
 ] 

Tim Allison commented on SOLR-11622:


Will do.  I'm finding some other things that need to be fixed as well.  I have 
no idea why neither I nor anyone else (apparently?) has run DIH on Tika's test 
files (at least recently?!)...  We've got to change this in our processes.

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch, SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
>

[jira] [Updated] (SOLR-11721) Isolate most of Tika and dependencies into separate jvm

2017-12-04 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-11721:
---
Summary: Isolate most of Tika and dependencies into separate jvm  (was: 
Isolate Tika and dependencies into separate jvm)

> Isolate most of Tika and dependencies into separate jvm
> ---
>
> Key: SOLR-11721
> URL: https://issues.apache.org/jira/browse/SOLR-11721
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>
> Tika should not be run in the same jvm as Solr.  Ever.  
> Upgrading Tika and hoping to avoid jar hell, while getting all of the 
> dependencies right manually is, um, error prone.  See my recent failure: 
> SOLR-11622, for which I apologize profusely.
> Running DIH against Tika's unit test documents has been eye-opening. It has 
> revealed some other version conflict/dependency failures that should have 
> been caught much earlier.
> The fix is non-trivial, but we should work towards it.
> I see two options:
> 1. TIKA-2514 -- Our current ForkParser offers a model for a minimal fork 
> process + server option.  The limitation currently is that all parsers and 
> dependencies must be serializable, which can be a problem for users adding 
> their own parsers with deps that might not be designed for serializability.  
> The proposal there is to rework the ForkParser to use a TIKA_HOME directory 
> for all dependencies.
> 2. SOLR-7632 -- use tika-server, but make it seamless and as easy (and 
> secure!) to use as the current handlers.
> Other thoughts, recommendations?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2017-12-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276888#comment-16276888
 ] 

Tim Allison commented on SOLR-7632:
---

bq. To carry out Erik Hatcher's recommendation...I don't know if we'd need CORS 
for this or not, but it might be neat to modify Tika's server to allow users to 
inject their own resources=endpoints via a config file and an extra jar. Within 
the Solr project, we'd just have to implement a resource that takes an input 
stream, runs Tika and then adds a SolrInputDocument.

[~gostep] has proposed allowing users to configure a custom ContentHandler in 
tika-server.  This could enable Solr to create its own content handler that 
tika-server could use to send the extracted text to Solr on endDocument().

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>  Labels: gsoc2017, memex
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-11721) Isolate Tika and dependencies into separate jvm

2017-12-04 Thread Tim Allison (JIRA)

Tim Allison created SOLR-11721:
--

 Summary: Isolate Tika and dependencies into separate jvm
 Key: SOLR-11721
 URL: https://issues.apache.org/jira/browse/SOLR-11721
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Tim Allison


Tika should not be run in the same jvm as Solr.  Ever.  

Upgrading Tika and hoping to avoid jar hell, while getting all of the 
dependencies right manually is, um, error prone.  See my recent failure: 
SOLR-11622, for which I apologize profusely.

Running DIH against Tika's unit test documents has been eye-opening. It has 
revealed some other version conflict/dependency failures that should have been 
caught much earlier.

The fix is non-trivial, but we should work towards it.
I see two options:

1. TIKA-2514 -- Our current ForkParser offers a model for a minimal fork 
process + server option.  The limitation currently is that all parsers and 
dependencies must be serializable, which can be a problem for users adding 
their own parsers with deps that might not be designed for serializability.  
The proposal there is to rework the ForkParser to use a TIKA_HOME directory for 
all dependencies.

2. SOLR-7632 -- use tika-server, but make it seamless and as easy (and secure!) 
to use as the current handlers.

Other thoughts, recommendations?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274994#comment-16274994
 ] 

Tim Allison edited comment on SOLR-11622 at 12/1/17 9:17 PM:
-

There's still a clash with jdom triggered by rss files and rometools

{noformat}
Exception in thread "Thread-21" java.lang.NoClassDefFoundError: 
org/jdom2/input/JDOMParseException
at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:63)
at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:51)
{noformat}

I'm confirming that should be bumped to 2.0.4.




was (Author: talli...@mitre.org):
There's still a clash with jdom triggered by rss files and rometools

{noformat]
Exception in thread "Thread-21" java.lang.NoClassDefFoundError: 
org/jdom2/input/JDOMParseException
at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:63)
at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:51)
{noformat}

I'm confirming that should be bumped to 2.0.4.



> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274994#comment-16274994
 ] 

Tim Allison commented on SOLR-11622:


There's still a clash with jdom triggered by rss files and rometools

{noformat]
Exception in thread "Thread-21" java.lang.NoClassDefFoundError: 
org/jdom2/input/JDOMParseException
at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:63)
at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:51)
{noformat}

I'm confirming that should be bumped to 2.0.4.



> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
>

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-12-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274578#comment-16274578
 ] 

Tim Allison commented on SOLR-11622:


Taking a look now.  I want to run all of Tika's unit test docs through it to 
make sure I didn't botch anything else...

You saw the POI bug in SOLR-11693?

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>

[jira] [Updated] (SOLR-11701) Upgrade to Tika 1.17 when available

2017-11-29 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-11701:
---
Description: Kicking off release process for Tika 1.17 in the next few 
days.  Please let us know if you have any requests.

> Upgrade to Tika 1.17 when available
> ---
>
> Key: SOLR-11701
> URL: https://issues.apache.org/jira/browse/SOLR-11701
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>
> Kicking off release process for Tika 1.17 in the next few days.  Please let 
> us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11693) Class loading problem for Tika/POI for some PPTX

2017-11-29 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-11693:
---
Description: 
[~advokat] reported TIKA-2497.  I can reproduce this issue with a Solr instance 
in both 6.6.2 and 7.1.0.

I can't reproduce it when I run the triggering file within Solr's unit tests or 
with straight Tika. 

Would anyone with more knowledge of classloading within Solr be able to help?

See TIKA-2497 for triggering file and conf files.

...turns out this is a bug in POI 3.16 and 3.17-beta1

  was:
[~advokat] reported TIKA-2497.  I can reproduce this issue with a Solr instance 
in both 6.6.2 and 7.1.0.

I can't reproduce it when I run the triggering file within Solr's unit tests or 
with straight Tika.  I can see CTTable as a class where it belongs in 
contrib/extract/lib/poi-ooxml-schemas-3.17-beta1.jar.

Would anyone with more knowledge of classloading within Solr be able to help?

See TIKA-2497 for triggering file and conf files.

Stacktrace:
{noformat}

500204org.apache.solr.common.SolrExceptionjava.lang.IllegalStateExceptionorg.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
... 34 more
Caused by: java.lang.IllegalStateException: Schemas (*.xsb) for CTTable can't 
be loaded - usually this

[jira] [Commented] (SOLR-11693) Class loading problem for Tika/POI for some PPTX

2017-11-29 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270677#comment-16270677
 ] 

Tim Allison commented on SOLR-11693:


[~yegor.kozlov] noted on the POI dev list that this is now fixed in POI 3.17.

> Class loading problem for Tika/POI for some PPTX
> 
>
> Key: SOLR-11693
> URL: https://issues.apache.org/jira/browse/SOLR-11693
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Affects Versions: 7.1
>Reporter: Tim Allison
>Priority: Minor
>
> [~advokat] reported TIKA-2497.  I can reproduce this issue with a Solr 
> instance in both 6.6.2 and 7.1.0.
> I can't reproduce it when I run the triggering file within Solr's unit tests 
> or with straight Tika.  I can see CTTable as a class where it belongs in 
> contrib/extract/lib/poi-ooxml-schemas-3.17-beta1.jar.
> Would anyone with more knowledge of classloading within Solr be able to help?
> See TIKA-2497 for triggering file and conf files.
> Stacktrace:
> {noformat}
> 
> 500 name="QTime">204 name="error-class">org.apache.solr.common.SolrException name="root-error-class">java.lang.IllegalStateException name="msg">org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 name="trace">org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62
> at

[jira] [Created] (SOLR-11701) Upgrade to Tika 1.17 when available

2017-11-29 Thread Tim Allison (JIRA)

Tim Allison created SOLR-11701:
--

 Summary: Upgrade to Tika 1.17 when available
 Key: SOLR-11701
 URL: https://issues.apache.org/jira/browse/SOLR-11701
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Tim Allison






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement

2017-11-28 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269524#comment-16269524
 ] 

Tim Allison commented on SOLR-11622:


Y.  This was my mistake/omission in SOLR-10335.  Ugh.

> Bundled mime4j library not sufficient for Tika requirement
> --
>
> Key: SOLR-11622
> URL: https://issues.apache.org/jira/browse/SOLR-11622
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 7.1, 6.6.2
>Reporter: Karim Malhas
>Assignee: Karthik Ramachandran
>Priority: Minor
>  Labels: build
> Attachments: SOLR-11622.patch
>
>
> The version 7.2 of Apache James Mime4j bundled with the Solr binary releases 
> does not match what is required by Apache Tika for parsing rfc2822 messages. 
> The master branch for james-mime4j seems to contain the missing Builder class
> [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java
> ]
> This prevents import of rfc2822 formatted messages. For example like so:
> {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt'
> }}
> And results in the following stacktrace:
> java.lang.NoClassDefFoundError: 
> org/apache/james/mime4j/stream/MimeConfig$Builder
> at 
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at 
>

[jira] [Updated] (SOLR-11693) Class loading problem for Tika/POI for some PPTX

2017-11-28 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-11693:
---
Affects Version/s: 7.1

> Class loading problem for Tika/POI for some PPTX
> 
>
> Key: SOLR-11693
> URL: https://issues.apache.org/jira/browse/SOLR-11693
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Affects Versions: 7.1
>Reporter: Tim Allison
>Priority: Minor
>
> [~advokat] reported TIKA-2497.  I can reproduce this issue with a Solr 
> instance in both 6.6.2 and 7.1.0.
> I can't reproduce it when I run the triggering file within Solr's unit tests 
> or with straight Tika.  I can see CTTable as a class where it belongs in 
> contrib/extract/lib/poi-ooxml-schemas-3.17-beta1.jar.
> Would anyone with more knowledge of classloading within Solr be able to help?
> See TIKA-2497 for triggering file and conf files.
> Stacktrace:
> {noformat}
> 
> 500 name="QTime">204 name="error-class">org.apache.solr.common.SolrException name="root-error-class">java.lang.IllegalStateException name="msg">org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 name="trace">org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at

[jira] [Updated] (SOLR-11693) Class loading problem for Tika/POI for some PPTX

2017-11-28 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-11693:
---
Priority: Minor  (was: Major)

> Class loading problem for Tika/POI for some PPTX
> 
>
> Key: SOLR-11693
> URL: https://issues.apache.org/jira/browse/SOLR-11693
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Affects Versions: 7.1
>Reporter: Tim Allison
>Priority: Minor
>
> [~advokat] reported TIKA-2497.  I can reproduce this issue with a Solr 
> instance in both 6.6.2 and 7.1.0.
> I can't reproduce it when I run the triggering file within Solr's unit tests 
> or with straight Tika.  I can see CTTable as a class where it belongs in 
> contrib/extract/lib/poi-ooxml-schemas-3.17-beta1.jar.
> Would anyone with more knowledge of classloading within Solr be able to help?
> See TIKA-2497 for triggering file and conf files.
> Stacktrace:
> {noformat}
> 
> 500 name="QTime">204 name="error-class">org.apache.solr.common.SolrException name="root-error-class">java.lang.IllegalStateException name="msg">org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 name="trace">org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62
> at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at

[jira] [Created] (SOLR-11693) Class loading problem for Tika/POI for some PPTX

2017-11-28 Thread Tim Allison (JIRA)

Tim Allison created SOLR-11693:
--

 Summary: Class loading problem for Tika/POI for some PPTX
 Key: SOLR-11693
 URL: https://issues.apache.org/jira/browse/SOLR-11693
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: contrib - DataImportHandler
Reporter: Tim Allison


[~advokat] reported TIKA-2497.  I can reproduce this issue with a Solr instance 
in both 6.6.2 and 7.1.0.

I can't reproduce it when I run the triggering file within Solr's unit tests or 
with straight Tika.  I can see CTTable as a class where it belongs in 
contrib/extract/lib/poi-ooxml-schemas-3.17-beta1.jar.

Would anyone with more knowledge of classloading within Solr be able to help?

See TIKA-2497 for triggering file and conf files.

Stacktrace:
{noformat}

500204org.apache.solr.common.SolrExceptionjava.lang.IllegalStateExceptionorg.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
... 34 more
Caused by: java.lang.IllegalStateException: Schemas (*.xsb) for CTTable can't 
be loaded - usually this happens when OSGI loading is used and the thread 
context classloader has no reference to the xmlbeans classes - use 
POIXMLTypeLoader.setClassLoader() to set

[jira] [Commented] (SOLR-8981) Upgrade to Tika 1.13 when it is available

2017-10-20 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212527#comment-16212527
 ] 

Tim Allison commented on SOLR-8981:
---

+1  Thank you, [~thetaphi]!

> Upgrade to Tika 1.13 when it is available
> -
>
> Key: SOLR-8981
> URL: https://issues.apache.org/jira/browse/SOLR-8981
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Tim Allison
>Assignee: Uwe Schindler
> Fix For: 5.5.5, 6.2, 7.0
>
>
> Tika 1.13 should be out within a month.  This includes PDFBox 2.0.0 and a 
> number of other upgrades and improvements.  
> If there are any showstoppers in 1.13 from Solr's side or requests before we 
> roll 1.13, let us know.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10335) Upgrade to Tika 1.16 when available

2017-10-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204121#comment-16204121
 ] 

Tim Allison commented on SOLR-10335:


Thank you, again!

> Upgrade to Tika 1.16 when available
> ---
>
> Key: SOLR-10335
> URL: https://issues.apache.org/jira/browse/SOLR-10335
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Shalin Shekhar Mangar
>Priority: Critical
> Fix For: 7.1, 6.6.2
>
>
> Once POI 3.16-beta3 is out (early/mid April?), we'll push for a release of 
> Tika 1.15.
> Please let us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x

2017-10-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203840#comment-16203840
 ] 

Tim Allison commented on SOLR-11450:


bq. I'm not familiar enough with Solr query parsers

Y, I've been away from this for too long and got the first couple of answers to 
[~bjarkebm] wrong on the user list because of the diff btwn Lucene and Solr.  
It is good to be back.

Thank you!

> ComplexPhraseQParserPlugin not running charfilter for some multiterm queries 
> in 6.x 
> 
>
> Key: SOLR-11450
> URL: https://issues.apache.org/jira/browse/SOLR-11450
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.6.1
>Reporter: Tim Allison
>Priority: Minor
>  Labels: patch-with-test
> Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch
>
>
> On the user list, [~bjarkebm] reported that the charfilter is not being 
> applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x.  Bjarke 
> fixed my proposed unit tests to prove this failure. All appears to work in 
> 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10335) Upgrade to Tika 1.16 when available

2017-10-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203804#comment-16203804
 ] 

Tim Allison commented on SOLR-10335:


[~shalinmangar], should I submit another PR for the 6_x and 6.6.2 branch or 
will you take it from here?  THANK YOU!!!

> Upgrade to Tika 1.16 when available
> ---
>
> Key: SOLR-10335
> URL: https://issues.apache.org/jira/browse/SOLR-10335
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Shalin Shekhar Mangar
>Priority: Critical
>
> Once POI 3.16-beta3 is out (early/mid April?), we'll push for a release of 
> Tika 1.15.
> Please let us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-10335) Upgrade to Tika 1.16 when available

2017-10-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203804#comment-16203804
 ] 

Tim Allison edited comment on SOLR-10335 at 10/13/17 4:32 PM:
--

[~shalinmangar], should I submit another PR for the 6_x and 6.6.2 branches or 
will you take it from here?  THANK YOU!!!


was (Author: talli...@mitre.org):
[~shalinmangar], should I submit another PR for the 6_x and 6.6.2 branch or 
will you take it from here?  THANK YOU!!!

> Upgrade to Tika 1.16 when available
> ---
>
> Key: SOLR-10335
> URL: https://issues.apache.org/jira/browse/SOLR-10335
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Shalin Shekhar Mangar
>Priority: Critical
>
> Once POI 3.16-beta3 is out (early/mid April?), we'll push for a release of 
> Tika 1.15.
> Please let us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x

2017-10-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203778#comment-16203778
 ] 

Tim Allison edited comment on SOLR-11450 at 10/13/17 4:30 PM:
--

Ha.  Right.  Solr does do its own thing.  {{FieldTypePluginLoader}} generates a 
multiterm analyzer in the TextField by subsetting the TokenizerChain's 
components that are MultitermAware and/or swapping in a KeywordAnalyzer 
--[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182]
 ...almost like {{Analyzer.normalize()}} in 7.x :)

Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} 
[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883],
 which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s 
multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above.

So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and 
all is good.  However, the CPQP doesn't extend the SolrQueryParserBase.  

Two things make this feel like a bug and not a feature in Solr 6.x:

1) multiterm analysis works for the classic query parser but not fully for the 
CPQP in Solr 6.x
2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse 
wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy.



was (Author: talli...@mitre.org):
Ha.  Right.  Solr does do its own thing.  {{FieldTypePluginLoader}} generates a 
multiterm analyzer in the TextField by subsetting the TokenizerChain's 
components that are MultitermAware and/or swapping in a KeywordAnalyzer 
--[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182]
 ...just like {{Analyzer.normalize()}} in 7.x :)

Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} 
[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883],
 which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s 
multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above.

So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and 
all is good.  However, the CPQP doesn't extend the SolrQueryParserBase.  

Two things make this feel like a bug and not a feature in Solr 6.x:

1) multiterm analysis works for the classic query parser but not fully for the 
CPQP in Solr 6.x
2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse 
wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy.


> ComplexPhraseQParserPlugin not running charfilter for some multiterm queries 
> in 6.x 
> 
>
> Key: SOLR-11450
> URL: https://issues.apache.org/jira/browse/SOLR-11450
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.6.1
>Reporter: Tim Allison
>Priority: Minor
>  Labels: patch-with-test
> Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch
>
>
> On the user list, [~bjarkebm] reported that the charfilter is not being 
> applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x.  Bjarke 
> fixed my proposed unit tests to prove this failure. All appears to work in 
> 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x

2017-10-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203778#comment-16203778
 ] 

Tim Allison edited comment on SOLR-11450 at 10/13/17 4:20 PM:
--

Ha.  Right.  Solr does do its own thing.  {{FieldTypePluginLoader}} generates a 
multiterm analyzer in the TextField by subsetting the TokenizerChain's 
components that are MultitermAware and/or swapping in a KeywordAnalyzer 
--[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182]
 ...just like {{Analyzer.normalize()}} in 7.x :)

Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} 
[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883],
 which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s 
multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above.

So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and 
all is good.  However, the CPQP doesn't extend the SolrQueryParserBase.  

Two things make this feel like a bug and not a feature in Solr 6.x:

1) multiterm analysis works for the classic query parser but not fully for the 
CPQP in Solr 6.x
2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse 
wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy.



was (Author: talli...@mitre.org):
Ha.  Right.  Solr does do its own thing.  {{FieldTypePluginLoader}} generates a 
multiterm analyzer in the TextField by subsetting the TokenizerChain's 
components that are MultitermAware and/or swapping in a KeywordAnalyzer 
--[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182]
 ...just like {{Analyzer.normalize()}} in 7.x :)

Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} 
[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883],
 which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s 
multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above.

So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and 
all is good.  However, the CPQP doesn't extend the SolrQueryParserBase.  

Two things make this feel like a bug and not a feature in Solr 6.x:

1) multiterm analysis works for the classic query parser in Solr 6.x
2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse 
wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy.


> ComplexPhraseQParserPlugin not running charfilter for some multiterm queries 
> in 6.x 
> 
>
> Key: SOLR-11450
> URL: https://issues.apache.org/jira/browse/SOLR-11450
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.6.1
>Reporter: Tim Allison
>Priority: Minor
>  Labels: patch-with-test
> Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch
>
>
> On the user list, [~bjarkebm] reported that the charfilter is not being 
> applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x.  Bjarke 
> fixed my proposed unit tests to prove this failure. All appears to work in 
> 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x

2017-10-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203778#comment-16203778
 ] 

Tim Allison edited comment on SOLR-11450 at 10/13/17 4:19 PM:
--

Ha.  Right.  Solr does do its own thing.  {{FieldTypePluginLoader}} generates a 
multiterm analyzer in the TextField by subsetting the TokenizerChain's 
components that are MultitermAware and/or swapping in a KeywordAnalyzer 
--[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182]
 ...just like {{Analyzer.normalize()}} in 7.x :)

Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} 
[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883],
 which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s 
multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above.

So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and 
all is good.  However, the CPQP doesn't extend the SolrQueryParserBase.  

Two things make this feel like a bug and not a feature in Solr 6.x:

1) multiterm analysis works for the classic query parser in Solr 6.x
2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse 
wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy.



was (Author: talli...@mitre.org):
Ha.  Right.  Solr does do its own thing.  {{FieldTypePluginLoader}} generates a 
multiterm analyzer in the TextField by subsetting the TokenizerChain's 
components that are MultitermAware and swapping in a KeywordTokenizer 
--[here|http://example.com] 
[https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182]
 ...just CustomAnalyzer's {{normalize()}} in 7.x :)

Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} 
[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883],
 which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s 
multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above.

So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and 
all is good.  However, the CPQP doesn't extend the SolrQueryParserBase.  

Two things make this feel like a bug and not a feature in Solr 6.x:

1) multiterm analysis works for the classic query parser in Solr 6.x
2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse 
wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy.


> ComplexPhraseQParserPlugin not running charfilter for some multiterm queries 
> in 6.x 
> 
>
> Key: SOLR-11450
> URL: https://issues.apache.org/jira/browse/SOLR-11450
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.6.1
>Reporter: Tim Allison
>Priority: Minor
>  Labels: patch-with-test
> Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch
>
>
> On the user list, [~bjarkebm] reported that the charfilter is not being 
> applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x.  Bjarke 
> fixed my proposed unit tests to prove this failure. All appears to work in 
> 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x

2017-10-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203778#comment-16203778
 ] 

Tim Allison commented on SOLR-11450:


Ha.  Right.  Solr does do its own thing.  {{FieldTypePluginLoader}} generates a 
multiterm analyzer in the TextField by subsetting the TokenizerChain's 
components that are MultitermAware and swapping in a KeywordTokenizer 
--[here|http://example.com] 
[https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182]
 ...just CustomAnalyzer's {{normalize()}} in 7.x :)

Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} 
[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883],
 which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s 
multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above.

So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and 
all is good.  However, the CPQP doesn't extend the SolrQueryParserBase.  

Two things make this feel like a bug and not a feature in Solr 6.x:

1) multiterm analysis works for the classic query parser in Solr 6.x
2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse 
wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy.


> ComplexPhraseQParserPlugin not running charfilter for some multiterm queries 
> in 6.x 
> 
>
> Key: SOLR-11450
> URL: https://issues.apache.org/jira/browse/SOLR-11450
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.6.1
>Reporter: Tim Allison
>Priority: Minor
>  Labels: patch-with-test
> Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch
>
>
> On the user list, [~bjarkebm] reported that the charfilter is not being 
> applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x.  Bjarke 
> fixed my proposed unit tests to prove this failure. All appears to work in 
> 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x

2017-10-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203573#comment-16203573
 ] 

Tim Allison commented on SOLR-11450:


[~jpountz], thank you for your response!

Y, the changes in 7.x are fantastic.

Am I misunderstanding 6.x, though?  This test passes, which suggests that 
normalization was working correctly for the classic queryparser in 6.x, but not 
the cpqp.  Or am I misunderstanding?

If your point is that this would be a breaking change for some users of cpqp 
and it therefore doesn't belong in a bugfix release, I'm willing to accept that.

{noformat}
  @Test
  public void testCharFilter() {
assertU(adoc("iso-latin1", "craezy traen", "id", "1"));
assertU(commit());
assertU(optimize());

assertQ(req("q",  "iso-latin1:cr\u00E6zy")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "iso-latin1:tr\u00E6n")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "iso-latin1:c\u00E6zy~1")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "iso-latin1:cr\u00E6z*")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "iso-latin1:*\u00E6zy")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "iso-latin1:cr\u00E6*y")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "iso-latin1:/cr\u00E6[a-z]y/")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "iso-latin1:[cr\u00E6zx TO cr\u00E6zz]")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);
}
{noformat}

> ComplexPhraseQParserPlugin not running charfilter for some multiterm queries 
> in 6.x 
> 
>
> Key: SOLR-11450
> URL: https://issues.apache.org/jira/browse/SOLR-11450
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.6.1
>Reporter: Tim Allison
>Priority: Minor
>  Labels: patch-with-test
> Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch
>
>
> On the user list, [~bjarkebm] reported that the charfilter is not being 
> applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x.  Bjarke 
> fixed my proposed unit tests to prove this failure. All appears to work in 
> 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x

2017-10-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203541#comment-16203541
 ] 

Tim Allison commented on SOLR-11450:


[~mkhludnev] or any other committer willing to review and push for 6.6.2?  

> ComplexPhraseQParserPlugin not running charfilter for some multiterm queries 
> in 6.x 
> 
>
> Key: SOLR-11450
> URL: https://issues.apache.org/jira/browse/SOLR-11450
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.6.1
>Reporter: Tim Allison
>Priority: Minor
>  Labels: patch-with-test
> Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch
>
>
> On the user list, [~bjarkebm] reported that the charfilter is not being 
> applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x.  Bjarke 
> fixed my proposed unit tests to prove this failure. All appears to work in 
> 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x

2017-10-13 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-11450:
---
Labels: patch-with-test  (was: )

> ComplexPhraseQParserPlugin not running charfilter for some multiterm queries 
> in 6.x 
> 
>
> Key: SOLR-11450
> URL: https://issues.apache.org/jira/browse/SOLR-11450
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.6.1
>Reporter: Tim Allison
>Priority: Minor
>  Labels: patch-with-test
> Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch
>
>
> On the user list, [~bjarkebm] reported that the charfilter is not being 
> applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x.  Bjarke 
> fixed my proposed unit tests to prove this failure. All appears to work in 
> 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10335) Upgrade to Tika 1.16 when available

2017-10-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203529#comment-16203529
 ] 

Tim Allison commented on SOLR-10335:


Thank you, [~shalinmangar]!  Is it worth backporting to 6.6.2?

> Upgrade to Tika 1.16 when available
> ---
>
> Key: SOLR-10335
> URL: https://issues.apache.org/jira/browse/SOLR-10335
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Tim Allison
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
>
> Once POI 3.16-beta3 is out (early/mid April?), we'll push for a release of 
> Tika 1.15.
> Please let us know if you have any requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x

2017-10-12 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202585#comment-16202585
 ] 

Tim Allison commented on SOLR-11450:


To get the directionality right...sorry.  The issue I opened is a duplicate of 
LUCENE-7687...my error.

> ComplexPhraseQParserPlugin not running charfilter for some multiterm queries 
> in 6.x 
> 
>
> Key: SOLR-11450
> URL: https://issues.apache.org/jira/browse/SOLR-11450
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.6.1
>Reporter: Tim Allison
>Priority: Minor
> Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch
>
>
> On the user list, [~bjarkebm] reported that the charfilter is not being 
> applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x.  Bjarke 
> fixed my proposed unit tests to prove this failure. All appears to work in 
> 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x

2017-10-12 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202573#comment-16202573
 ] 

Tim Allison commented on SOLR-11450:


[~mikemccand] [~jpountz], any chance you'd be willing to review and push this 
into 6.6.2?

> ComplexPhraseQParserPlugin not running charfilter for some multiterm queries 
> in 6.x 
> 
>
> Key: SOLR-11450
> URL: https://issues.apache.org/jira/browse/SOLR-11450
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.6.1
>Reporter: Tim Allison
>Priority: Minor
> Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch
>
>
> On the user list, [~bjarkebm] reported that the charfilter is not being 
> applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x.  Bjarke 
> fixed my proposed unit tests to prove this failure. All appears to work in 
> 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7687) ComplexPhraseQueryParser with AsciiFoldingFilterFactory (SOLR)

2017-10-12 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202561#comment-16202561
 ] 

Tim Allison commented on LUCENE-7687:
-

There's a patch available on SOLR-11450.  This seems to have been fixed in 7.x

> ComplexPhraseQueryParser with AsciiFoldingFilterFactory (SOLR)
> --
>
> Key: LUCENE-7687
> URL: https://issues.apache.org/jira/browse/LUCENE-7687
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.4.1
> Environment: solr-6.4.1 (yes, solr, but I don't know where the bug 
> exactly is)
>Reporter: Jochen Barth
>
> I modified generic *_txt-Field type to use AsciiFoldingFilterFactory on query 
> & index.
> When quering with
> \{!complexphrase}text_txt:"König*" -- there are 0 results
> \{!complexphrase}text_txt:"Konig*" -- there are >0 results
> \{!complexphrase}text_txt:"König" -- there are >0 results (but less than the 
> line above)
> and without \{!complexphrase} everything works o.k.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11462) TokenizerChain's normalize() doesn't work

2017-10-10 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199244#comment-16199244
 ] 

Tim Allison commented on SOLR-11462:


Could also submit PR for getting rid of TokenizerChain in favor of 
CustomAnalyzer. :)

> TokenizerChain's normalize() doesn't work
> -
>
> Key: SOLR-11462
> URL: https://issues.apache.org/jira/browse/SOLR-11462
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Trivial
>
> TokenizerChain's {{normalize()}} is not currently used so this doesn't 
> currently have any negative effects on search.  However, there is a bug, and 
> we should fix it.
> If applied to a TokenizerChain with {{filters.length > 1}}, only the last 
> would apply. 
>  
> {noformat}
>  @Override
>   protected TokenStream normalize(String fieldName, TokenStream in) {
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> return result;
>   }
> {noformat}
> The fix is trivial:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}
> If you'd like to swap out {{TextField#analyzeMultiTerm()}} with, say:
> {noformat}
>   public static BytesRef analyzeMultiTerm(String field, String part, Analyzer 
> analyzerIn) {
> if (part == null || analyzerIn == null) return null;
> return analyzerIn.normalize(field, part);
>   }
> {noformat}
> I'm happy to submit a PR with unit tests.  Let me know.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-5317) Concordance/Key Word In Context (KWIC) capability

2017-10-10 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199160#comment-16199160
 ] 

Tim Allison edited comment on LUCENE-5317 at 10/10/17 6:40 PM:
---

A prototype ASL 2.0 application that demonstrates the utility of the 
concordance is available: https://github.com/mitre/rhapsode


was (Author: talli...@mitre.org):
an prototype ASF 2.0 application that demonstrates the utility of the 
concordance is available: https://github.com/mitre/rhapsode

> Concordance/Key Word In Context (KWIC) capability
> -
>
> Key: LUCENE-5317
> URL: https://issues.apache.org/jira/browse/LUCENE-5317
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/search
>Affects Versions: 4.5
>Reporter: Tim Allison
>Assignee: Tommaso Teofili
>  Labels: patch
> Attachments: LUCENE-5317.patch, LUCENE-5317.patch, 
> concordance_v1.patch.gz, lucene5317v1.patch, lucene5317v2.patch
>
>
> This patch enables a Lucene-powered concordance search capability.
> Concordances are extremely useful for linguists, lawyers and other analysts 
> performing analytic search vs. traditional snippeting/document retrieval 
> tasks.  By "analytic search," I mean that the user wants to browse every time 
> a term appears (or at least the topn)  in a subset of documents and see the 
> words before and after.  
> Concordance technology is far simpler and less interesting than IR relevance 
> models/methods, but it can be extremely useful for some use cases.
> Traditional concordance sort orders are available (sort on words before the 
> target, words after, target then words before and target then words after).
> Under the hood, this is running SpanQuery's getSpans() and reanalyzing to 
> obtain character offsets.  There is plenty of room for optimizations and 
> refactoring.
> Many thanks to my colleague, Jason Robinson, for input on the design of this 
> patch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5317) Concordance/Key Word In Context (KWIC) capability

2017-10-10 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199160#comment-16199160
 ] 

Tim Allison commented on LUCENE-5317:
-

an prototype ASF 2.0 application that demonstrates the utility of the 
concordance is available: https://github.com/mitre/rhapsode

> Concordance/Key Word In Context (KWIC) capability
> -
>
> Key: LUCENE-5317
> URL: https://issues.apache.org/jira/browse/LUCENE-5317
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/search
>Affects Versions: 4.5
>Reporter: Tim Allison
>Assignee: Tommaso Teofili
>  Labels: patch
> Attachments: LUCENE-5317.patch, LUCENE-5317.patch, 
> concordance_v1.patch.gz, lucene5317v1.patch, lucene5317v2.patch
>
>
> This patch enables a Lucene-powered concordance search capability.
> Concordances are extremely useful for linguists, lawyers and other analysts 
> performing analytic search vs. traditional snippeting/document retrieval 
> tasks.  By "analytic search," I mean that the user wants to browse every time 
> a term appears (or at least the topn)  in a subset of documents and see the 
> words before and after.  
> Concordance technology is far simpler and less interesting than IR relevance 
> models/methods, but it can be extremely useful for some use cases.
> Traditional concordance sort orders are available (sort on words before the 
> target, words after, target then words before and target then words after).
> Under the hood, this is running SpanQuery's getSpans() and reanalyzing to 
> obtain character offsets.  There is plenty of room for optimizations and 
> refactoring.
> Many thanks to my colleague, Jason Robinson, for input on the design of this 
> patch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11462) TokenizerChain's normalize() doesn't work

2017-10-10 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-11462:
---
Affects Version/s: master (8.0)

> TokenizerChain's normalize() doesn't work
> -
>
> Key: SOLR-11462
> URL: https://issues.apache.org/jira/browse/SOLR-11462
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (8.0)
>Reporter: Tim Allison
>Priority: Trivial
>
> TokenizerChain's {{normalize()}} is not currently used so this doesn't 
> currently have any negative effects on search.  However, there is a bug, and 
> we should fix it.
> If applied to a TokenizerChain with {{filters.length > 1}}, only the last 
> would apply. 
>  
> {noformat}
>  @Override
>   protected TokenStream normalize(String fieldName, TokenStream in) {
> TokenStream result = in;
> for (TokenFilterFactory filter : filters) {
>   if (filter instanceof MultiTermAwareComponent) {
> filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
> filter).getMultiTermComponent();
> result = filter.create(in);
>   }
> }
> return result;
>   }
> {noformat}
> The fix is trivial:
> {noformat}
> -result = filter.create(in);
> +result = filter.create(result);
> {noformat}
> If you'd like to swap out {{TextField#analyzeMultiTerm()}} with, say:
> {noformat}
>   public static BytesRef analyzeMultiTerm(String field, String part, Analyzer 
> analyzerIn) {
> if (part == null || analyzerIn == null) return null;
> return analyzerIn.normalize(field, part);
>   }
> {noformat}
> I'm happy to submit a PR with unit tests.  Let me know.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-11462) TokenizerChain's normalize() doesn't work

2017-10-10 Thread Tim Allison (JIRA)

Tim Allison created SOLR-11462:
--

 Summary: TokenizerChain's normalize() doesn't work
 Key: SOLR-11462
 URL: https://issues.apache.org/jira/browse/SOLR-11462
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Tim Allison
Priority: Trivial


TokenizerChain's {{normalize()}} is not currently used so this doesn't 
currently have any negative effects on search.  However, there is a bug, and we 
should fix it.

If applied to a TokenizerChain with {{filters.length > 1}}, only the last would 
apply. 
 
{noformat}
 @Override
  protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream result = in;
for (TokenFilterFactory filter : filters) {
  if (filter instanceof MultiTermAwareComponent) {
filter = (TokenFilterFactory) ((MultiTermAwareComponent) 
filter).getMultiTermComponent();
result = filter.create(in);
  }
}
return result;
  }
{noformat}

The fix is trivial:
{noformat}
-result = filter.create(in);
+result = filter.create(result);
{noformat}

If you'd like to swap out {{TextField#analyzeMultiTerm()}} with, say:

{noformat}
  public static BytesRef analyzeMultiTerm(String field, String part, Analyzer 
analyzerIn) {
if (part == null || analyzerIn == null) return null;
return analyzerIn.normalize(field, part);
  }
{noformat}

I'm happy to submit a PR with unit tests.  Let me know.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 5 >

1 - 100 of 420 matches

Mail list logo