[ 
https://issues.apache.org/jira/browse/SOLR-15381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337404#comment-17337404
 ] 

Jan Høydahl commented on SOLR-15381:
------------------------------------

As mentioned elsewhere, this tool is not indented for any produciton or serious 
use, just for playing around with Solr and throwing some test data into it. So 
it is a non-goal to extend this to handle all kinds of invalid HTML or all 
kinds of web sites. I think I'm tempted to close this as won't fix. But if you 
have a very simple Pull Request that fixes it we can still consider it...

> SimplePostTool.java PageFetcher error
> -------------------------------------
>
>                 Key: SOLR-15381
>                 URL: https://issues.apache.org/jira/browse/SOLR-15381
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SimplePostTool
>            Reporter: QualiteSys QualiteSys
>            Priority: Major
>
> The SimplePostTool fails to grab web pages in simple cases.
> The getLinksFromWebPage process fails to detect url within the html page in 
> line 1252. Seams to be a problem when the html page is not perfect, from the 
> xml point of view.
>  
> Example to reproduce the problem :
> java -Dc=techproducts -Ddata=web -Drecursive=3 -jar 
> example\exampledocs\post.jar [http://www.google.com/]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to