[ 
https://issues.apache.org/jira/browse/CONNECTORS-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680908#comment-13680908
 ] 

Koji Sekiguchi commented on CONNECTORS-710:
-------------------------------------------

Hi Karl,

I talked with Osuka-san and I think his proposal is very natural.

The other day, MCF got FileSystem output connector CONNECTORS-696 implemented 
by him. Then this ticket comes out. He wants MCF to reproduce the original URL. 
Please consider the following use case:

# MCF crawls web site (e.g. http://manifoldcf.apache.org/ ) and outputs the 
pages to the local file system by using CONNECTORS-696. Now we got 
/Users/minoru/tmp/http/manifoldcf.apache.org/index.html etc. on our local 
server.
# Now we can execute an arbitrary program for the downloaded pages. For 
example, the program probably removes particular blocks (useless for search 
such as ad, side menus, etc.) from those html pages. To process like this, the 
program will read pages under /Users/minoru/tmp/http directory recursively, 
process pages (remove particular blocks, for example) and then output 
/Users/minoru/*out*/http directory respectively.
# Finally, we use MCF FileConnector to crawl /Users/minoru/*out*/http 
directory. We need his proposal option to reproduce the URL of the original 
data source, because we want to search html pages on Solr, click the link, we 
want to go the original web site, not server running MCF.
                
> FileConnector should have option of outputting a full http url, not just a 
> file:/ url
> -------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-710
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-710
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: File system connector
>    Affects Versions: ManifoldCF 1.3
>            Reporter: Minoru Osuka
>             Fix For: ManifoldCF 1.3
>
>         Attachments: Screen Shot 2013-06-11 at 5.46.55 PM.png
>
>
> I would like to enhance that FileConnector be able to convert from file path 
> to URI.
> FileOutputConnector will output the file path like Wget.
> $OUTPUT_PATH/http/localhost:8345/mcf-crawler-ui/showjobstatus.jsp
> I would like to enhance that FileConector be able to put documentIdentifere 
> like WebConnector.
> Current FileConnector can output id following,
> {code:xml}
> <str 
> name="id">file:/Users/minoru/tmp/out/http/localhost:8345/mcf-crawler-ui/showjobstatus.jsp</str>
> {code}
> Enhanced FileConnector can output id following,
> {code:xml}
> <str 
> name="id">file:/Users/minoru/tmp/out/http/localhost:8345/mcf-crawler-ui/showjobstatus.jsp</str>
> {code}
> or
> {code:xml}<str 
> name="id">http://localhost:8345/mcf-crawler-ui/showjobstatus.jsp</str></str>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to