[
https://issues.apache.org/jira/browse/SOLR-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-2802:
---------------------------
Attachment: SOLR-2802_update_processor_toolkit.patch
I had some time to revisit this issue more again today.
Improvements in this patch:
* exclude options - you can now specify one ore more sets of "exclude" lists
which are parsed just like the main list of field specifies (examples below)
* improved defaults for ConcatFieldUpdateProcessorFactory - default behavior is
now to only concat values for fields that the schema says are multiValued=false
and (StrField or TextField)
* new RemoveBlankFieldUpdateProcessorFactory - removes any 0 length
CharSequence values it finds, by default looks at all fields
* new FieldLengthUpdateProcessorFactory - replaces any CharSequence values it
finds with their length, by default it looks at no fields
As part of this work, i tweaked the abstract classes so that the "default"
assumption about what fields a subclass should match "by default" is still "all
fields" but it's easy for the subclasses to override this -- the user still has
the final say, and the abstract class handles that, but if the user doesn't
configure anything the sub-class can easily say "my default should be ___"
bq. I think I don't completely follow the explicit ruling
I explained myself really terribly before - i was convoluting what should
really be two orthogonal things:
1) the *field names* that a processor looks at -- the user should have lots of
options for configuring the field selector explicitly, and if they don't, then
a sensible default based on the specifics of the processor should be applied,
and the user should still have the ability to configure exclusion rules on top
of that default
2) the *values types* that a process will deal with -- regardless of what field
names a processor is configured with, it should be logical about the types of
values it finds in those fields. The FieldLengthUpdateProcessorFactory i just
added for example only pays attention to values that are CharSequence, if for
example the SolrInputField already contained an Integer wouldn't make sense to
toString() that and then find the length of that String vlaue.
bq. I think Date/Number parsing should only be done on compatible fields only.
I think if a subsequent parser moves / renames fields, then this processor
should have been configured before the processor that does the Date/Number
parsing.
But that could easily lead to a chicken-vs-egg problem. I think ideally you
should be able to have field names in your SolrInputDocuments (and in your
processor configurations) that don't exist in your schema at all, so you can
have "transitory" names that exist purely for passing info arround.
Imagine a situation where you want to let clients submit documents containing a
"publishDate" field, but you want to be able to cleanly accept real Date
objects (from java clients) or Strings in a variety of formats, and then you
want the final index to contain two versions of that date: one indexed
TrieDateField called "pubDate", and one non indexed StrField called
"prettyDate" -- ie, there is no "publishDate" in your schema at all. You
could then configure some "ParseDateFieldUpdateProcessor" on the "publishDate"
even though that field name isn't in your schema, so that you have consistent
Date objects, and then use a CloneFieldUpdateProcessor and/or
RenameFieldUpdateProcessor to get that Date object into both your "pubDate" and
"prettyDate" fields, and then use some sort of FormatDateFieldUpdateProcessor
on the "prettyDate" field.
There may be other solutions to that type of problem, but I guess the bottom
line from my perspective is: why bother making a processor deliberately fails
the user configures it to do something unexpected but still viable? If they
want to Parse Strings -> Dates on a TrieIntField, why not just let them do it?
maybe they've got another processor later that is going to convert that Date to
"days since epoc" as an integer?
{panel}
Examples of the exclude configuration...
{code}
<updateRequestProcessorChain name="trim-few">
<processor class="solr.TrimFieldUpdateProcessorFactory">
<str name="fieldRegex">foo.*</str>
<str name="fieldRegex">bar.*</str>
<!-- each set of exclusions is checked independently -->
<lst name="exclude">
<str name="typeClass">solr.DateField</str>
</lst>
<lst name="exclude">
<str name="fieldRegex">.*HOSS.*</str>
</lst>
</processor>
</updateRequestProcessorChain>
<updateRequestProcessorChain name="trim-some">
<processor class="solr.TrimFieldUpdateProcessorFactory">
<str name="fieldRegex">foo.*</str>
<str name="fieldRegex">bar.*</str>
<!-- only excluded if it matches all in set -->
<lst name="exclude">
<str name="typeClass">solr.DateField</str>
<str name="fieldRegex">.*HOSS.*</str>
</lst>
</processor>
</updateRequestProcessorChain>
{code}
In the "trim-few" case, field names will be excluded if they are DateFields
_or_ match the "HOSS" regex. In the "trim-some" case, field names will be
excluded only if they are _both_ a DateField _and_ match the "HOSS" regex.
{panel}
> Toolkit of UpdateProcessors for modifying document values
> ---------------------------------------------------------
>
> Key: SOLR-2802
> URL: https://issues.apache.org/jira/browse/SOLR-2802
> Project: Solr
> Issue Type: New Feature
> Reporter: Hoss Man
> Attachments: SOLR-2802_update_processor_toolkit.patch,
> SOLR-2802_update_processor_toolkit.patch
>
>
> Frequently users ask about questions about things where the answer is "you
> could do it with an UpdateProcessor" but the number of our of hte box
> UpdateProcessors is generally lacking and there aren't even very good base
> classes for the common case of manipulating field values when adding documents
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]