[
https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079085#comment-13079085
]
Kate McGonigal commented on CONNECTORS-235:
-------------------------------------------
I'm afraid these problems still exist for me.
A few hours ago I built the latest from trunk. It is running on PostgreSQL.
Just in case, I also started from a fresh install of Solr 3.3.0. I'm using the
example that comes with the distribution. It is thus running on Derby. I
realize the schema is not optimal for RSS feeds, but it does include a
"description" field, which is what I'm interested in at the moment.
Problem 1) When I try running the example job with "Dechromed Content" set to
"No dechromed content", what shows up in the description field (for all
documents) is "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice
Hogue." which is not the item-description in the RSS feed's XML, but rather
from the website's metadata description element in the HTML. I have tried
another RSS feed, with the same result.
Problem 2) When I try running the example job (see original post) with
"Dechromed Content" set to "if present, in 'description' field" it still hangs
with the log file showing:
{quote}FATAL 2011-08-03 16:08:21,703 (Worker thread '10') - Error tossed:
java.lang.String cannot be cast to
org.apache.manifoldcf.core.interfaces.CharacterInput
java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.manifoldcf.core.interfaces.CharacterInput
at
org.apache.manifoldcf.crawler.jobs.Carrydown.getDataValuesAsFiles(Carrydown.java:611)
at
org.apache.manifoldcf.crawler.jobs.JobManager.retrieveParentDataAsFiles(JobManager.java:4263)
at
org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.retrieveParentDataAsFiles(WorkerThread.java:1221)
at
org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.getDocumentVersions(RSSConnector.java:824)
at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321){quote}
And just to be clear on what I am ultimately trying to do: I'd like to be able
to show my searchers the "description" from the RSS feed for each of the
documents that match their searches. I actually only need to index the
item-description field (as opposed to what is at the item link) since my RSS
feeds are of scientific papers that will have a detailed abstract in the
item-description.
> item description element not indexed
> ------------------------------------
>
> Key: CONNECTORS-235
> URL: https://issues.apache.org/jira/browse/CONNECTORS-235
> Project: ManifoldCF
> Issue Type: Improvement
> Components: RSS connector
> Affects Versions: ManifoldCF 0.2
> Reporter: Kate McGonigal
> Assignee: Karl Wright
> Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr
> index.
> I have a typical RSS feed with the general structure:
> <rss>
> <channel>
> <title></title>
> <link></link>
> <description></description>
> <item>
> <title></title>
> <link></link>
> <pubDate></pubDate>
> <description> *** the description I do want *** </description>
> <author></author>
> <category></category>
> </item>
> </channel>
> </rss>
> Example:
> For the RSS feed:
> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
> - what does get written to the Solr "description" field is the description
> metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9
> FM, hosted by Maurice Hogue." in this case.
> - on the "Dechromed Content" tab of the job, "No dechromed content" is
> selected. I'm not sure if that is relevant.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira