[
https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482114#comment-17482114
]
Karl Wright commented on CONNECTORS-1695:
-----------------------------------------
The interestingMimeTypes are mime types that may be HTML or XHTML, or documents
where links cannot be extracted but where the content can be indexed. What
actually happens to a given document depends on what is actually in it rather
than what the mime type says. The mimetypes are a crude filter only.
The code that we'd want to modify would be the extractLinks code:
{code}
// Now, extract links.
// We'll call the "link extractor" series, so we can plug more
stuff in over time.
boolean indexDocument =
extractLinks(documentIdentifier,activities,filter);
{code}
The code for this method is:
{code}
/** Code to extract links from an already-fetched document. */
protected boolean extractLinks(String
documentIdentifier, IProcessActivity activities, DocumentURLFilter filter)
throws ManifoldCFException, ServiceInterruption
{
ProcessActivityRedirectionHandler redirectHandler = new
ProcessActivityRedirectionHandler(documentIdentifier,activities,filter);
handleRedirects(documentIdentifier,redirectHandler);
if (Logging.connectors.isDebugEnabled() && redirectHandler.shouldIndex() ==
false)
Logging.connectors.debug("Web: Not
indexing document '"+documentIdentifier+"' because of redirection");
// For html, we don't want any actions, because we don't do form submission.
ProcessActivityHTMLHandler htmlHandler = new
ProcessActivityHTMLHandler(documentIdentifier,activities,filter,metaRobotsTagsUsage);
handleHTML(documentIdentifier,htmlHandler);
if (Logging.connectors.isDebugEnabled()
&& htmlHandler.shouldIndex() == false)
Logging.connectors.debug("Web: Not indexing document '"+documentIdentifier+"'
because of HTML robots or content tags prohibiting indexing");
ProcessActivityXMLHandler xmlHandler = new
ProcessActivityXMLHandler(documentIdentifier,activities,filter);
handleXML(documentIdentifier,xmlHandler);
if
(Logging.connectors.isDebugEnabled() && xmlHandler.shouldIndex() == false)
Logging.connectors.debug("Web: Not
indexing document '"+documentIdentifier+"' because of XML robots or content
tags prohibiting indexing");
// May add more later for other extraction tasks.
return htmlHandler.shouldIndex() && redirectHandler.shouldIndex() &&
xmlHandler.shouldIndex();
}
{code}
Note that there are three different parsing attempts made: HTML, XML (which is
I believe RSS feeds only at this point) and redirection pages. You could add a
fourth.
Most of these invoke the fuzzyml parser, which is a bottom-up parser with
overrides for specific interesting tags. Even though the sitemap xml is
supposedly well formed, you wouldn't want to bet on it, and the fuzzyml parser
would be a reasonable technology to do parsing of this kind since it is quite
resilient against syntax errors of all kinds.
So the trick would be to identify the tag structure of a sitemap document and
extend the overrides present for the parser the web connector is using to
understand that xml syntax IN ADDITION TO the xhtml it already understands.
> Sitemap xml not detected in version 2.17 webconnector
> -----------------------------------------------------
>
> Key: CONNECTORS-1695
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1695
> Project: ManifoldCF
> Issue Type: Bug
> Components: Web connector
> Affects Versions: ManifoldCF 2.17
> Reporter: DK
> Priority: Major
>
> Trying to index sitemap xml and web connector index the whole xml into solr.
> Please fix in version 2.17.
> If it is any special config that needs to be taken care, please add here and
> add in documentation to make it clear.
>
> Sitemap.xml:
> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
> <sitemap>
> <loc>https://<url>/sitemap_1.xml</loc>
> <lastmod>2022-01-21T16:04:45Z</lastmod>
> </sitemap>
> </sitemapindex>
>
> sitemap_1.xml:
> <urlset>
> <url>
> <loc>https://<docurl></loc>
> <lastmod>2018-10-31T11:25:27Z</lastmod>
> </url>
> </urlset>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)