[
https://issues.apache.org/jira/browse/SOLR-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jan Høydahl updated SOLR-2826:
------------------------------
Attachment: SOLR-2826.patch
Here's the code. This code has been running in production for months.
Sample config:
{code}
<processor class="org.apache.solr.update.processor.URLClassifyProcessorFactory">
<bool name="enabled">true</bool>
<str name="inputField">myUrl</str>
<str name="domainOutputField">host</str>
<str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
{code}
This will read the url from field "myUrl", analyze it and write host name to
"host", a canonical (normalized) version of URL to "canonicalurl", URL length
to "url_length", number of levels in URL to "url_levels", if URL is a toplevel
URL, write "1" to field "url_toplevel", if it looks like a landing page, e.g.
index.html, write "1" to field "url_landingpage"...
> URLClassify Update Processor
> ----------------------------
>
> Key: SOLR-2826
> URL: https://issues.apache.org/jira/browse/SOLR-2826
> Project: Solr
> Issue Type: New Feature
> Components: update
> Reporter: Jan Høydahl
> Labels: UpdateProcessor
> Fix For: 3.6, 4.0
>
> Attachments: SOLR-2826.patch
>
>
> Processor which analyzes a URL and outputs to other fields: length, #levels,
> isTopLevel true/false, host part, path part, canonicalized URL etc.
> Kindly donated by Oslo University
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]