epugh commented on PR #3674:
URL: https://github.com/apache/solr/pull/3674#issuecomment-3481051804
> > The PR introduces breaking changes (therefore backporting should
probably be avoided). Apache Tika 2 and 3 standardized the metadata fields,
which affect the returned fields.
>
> I tackled that in the `tikaserver` backend by adding a Metadata mapper
that, if enabled, will map from e.g. `dc.author` to `Author` to please what
users might have come to expect in Tika1.x. If you intend to pursue some
upgrade in the 9.x line, re-using that class could perhaps make the upgrade
somewhat more compatible. But if it is compatible enough to warrant this
breaking change in 9.x I don't know.
>
> I'd not be opposed to announce that a "necessary" breaking change will
happen in, say 9.11, due to security risks, and then prepare users for the
change. I kept the mapping option hidden, un-documented, since I don't want us
to have to support it. But one could offer a user-supplied map `{"from": "to",
"from2", "to2"}` where she could tailor this. Or, perhaps that would not be
needed since we already have the fmap feature able to map fields, e.g.
`fmap.dc.author=Author`.
I think this is reasonable. Upgrading 9x to using Tika 2 or 3 is a huge
effort, and the payoff I don't think is there. We have a better path forward
with the new pluggable backends, and that is a better route forward.
Anyone using Tika needs to anticipate upgrading their codebase anyway for
Solr 10.
I think documenting these either or both of the alternative approaches is
fine. I suspect the vast majority of users of Tika will either NOT upgrade,
or jump to Solr 10 directly, which is IMO what they should do! Just the
fact that we are moving from Tika 1 to Tika 3 means usrs will want to
revalidate everythign anyway, so they won't be able to easily move Solr 9
versions anyway, because we all know that Tika 3 is going to handle documents
slightly differently than Tika 1 did, and users will need to
test/validate/understand that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]