[
https://issues.apache.org/jira/browse/SOLR-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191962#comment-14191962
]
yuanyun.cn commented on SOLR-3975:
----------------------------------
It would be good if Solr can provide a document summarization processor, so
during index, we can get the summary of the document and save it into index.
> Document Summarization toolkit, using LSA techniques
> ----------------------------------------------------
>
> Key: SOLR-3975
> URL: https://issues.apache.org/jira/browse/SOLR-3975
> Project: Solr
> Issue Type: New Feature
> Reporter: Lance Norskog
> Priority: Minor
> Attachments: 4.1.summary.patch, reuters.sh
>
>
> This package analyzes sentences and words as used across sentences to rank
> the most important sentences and words. The general topic is called "document
> summarization" and is a popular research topic in textual analysis.
> How to use:
> 1) Check out the 4.x branch, apply the patch, build, and run the solr/example
> instance.
> 2) Download the first Reuters article corpus from:
> http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
> 3) Unpack this into a directory.
> 4) Run the attached 'reuters.sh' script:
> sh reuters.sh directory http://localhost:8983/solr/collection1
> 5) Wait several minutes.
> Now go to http://localhost:8983/solr/collection1/browse?summary=true and look
> at the large gray box marked 'Document Summary'. This has a table of
> statistics about the analysis, the three most important sentences, and
> several of the most important words in the documents. The sentences have the
> important words in italics.
> The code is packaged as a search component and as an analysis handler. The
> /browse demo uses the search component, and you can also post raw text to
> http://localhost:8983/solr/collection1/analysis/summary. Here is a sample
> command:
> {code}
> curl -s
> "http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml"
> --data-binary @$FILE -H 'Content-type:application/xml'
> {code}
> This is an implementation of LSA-based document summarization. A short
> explanation and a long evaluation are described in my blog, [Uncle Lance's
> Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here:
> [http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]