[ 
https://issues.apache.org/jira/browse/SOLR-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-2826:
------------------------------

    Attachment: SOLR-2826.patch

Here's the code. This code has been running in production for months.

Sample config:

{code}
<processor class="org.apache.solr.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">myUrl</str>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
{code}

This will read the url from field "myUrl", analyze it and write host name to 
"host", a canonical (normalized) version of URL to "canonicalurl", URL length 
to "url_length", number of levels in URL to "url_levels", if URL is a toplevel 
URL, write "1" to field "url_toplevel", if it looks like a landing page, e.g. 
index.html, write "1" to field "url_landingpage"...
                
> URLClassify Update Processor
> ----------------------------
>
>                 Key: SOLR-2826
>                 URL: https://issues.apache.org/jira/browse/SOLR-2826
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>              Labels: UpdateProcessor
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2826.patch
>
>
> Processor which analyzes a URL and outputs to other fields: length, #levels, 
> isTopLevel true/false, host part, path part, canonicalized URL etc.
> Kindly donated by Oslo University

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to