[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents
[ https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783547#comment-16783547 ] Nazerke Seidan commented on SOLR-7229: -- Hi Tim, I was wondering whether this project is still open or not? I would like to participate in GSoC'19 by contributing to solr community. > Allow DIH to handle attachments as separate documents > - > > Key: SOLR-7229 > URL: https://issues.apache.org/jira/browse/SOLR-7229 > Project: Solr > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Alexandre Rafalovitch >Priority: Minor > Labels: gsoc2017 > > With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata > of individual attachments/embedded documents. Tika's default handling was to > maintain the metadata of the container document and concatenate the contents > of all embedded files. With SOLR-7189, we added the legacy behavior. > It might be handy, for example, to be able to send an MSG file through DIH > and treat the container email as well each attachment as separate (child?) > documents, or send a zip of jpeg files and correctly index the geo locations > for each image file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents
[ https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542670#comment-15542670 ] Alexandre Rafalovitch commented on SOLR-7229: - I haven't started working on this yet. Just assigned it to myself to ensure it is not lost. If you have any additional thoughts or implementation ideas, feel free to contribute. > Allow DIH to handle attachments as separate documents > - > > Key: SOLR-7229 > URL: https://issues.apache.org/jira/browse/SOLR-7229 > Project: Solr > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Alexandre Rafalovitch >Priority: Minor > > With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata > of individual attachments/embedded documents. Tika's default handling was to > maintain the metadata of the container document and concatenate the contents > of all embedded files. With SOLR-7189, we added the legacy behavior. > It might be handy, for example, to be able to send an MSG file through DIH > and treat the container email as well each attachment as separate (child?) > documents, or send a zip of jpeg files and correctly index the geo locations > for each image file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents
[ https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542313#comment-15542313 ] Tim Allison commented on SOLR-7229: --- Let me know how I can help. > Allow DIH to handle attachments as separate documents > - > > Key: SOLR-7229 > URL: https://issues.apache.org/jira/browse/SOLR-7229 > Project: Solr > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Alexandre Rafalovitch >Priority: Minor > > With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata > of individual attachments/embedded documents. Tika's default handling was to > maintain the metadata of the container document and concatenate the contents > of all embedded files. With SOLR-7189, we added the legacy behavior. > It might be handy, for example, to be able to send an MSG file through DIH > and treat the container email as well each attachment as separate (child?) > documents, or send a zip of jpeg files and correctly index the geo locations > for each image file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents
[ https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356721#comment-14356721 ] Tim Allison commented on SOLR-7229: --- Got it. To confirm, the idiomatic way to do this would be to configure the TikaEntityProcessor to create fields for latitude and longitude and then apply a combination of CloneFieldUpdateProcessorFactory and ConcatFieldUpdateProcessorFactory to do the concatenation? Is there a way to configure the concatenation without creating separate latitude and longitude fields? Allow DIH to handle attachments as separate documents - Key: SOLR-7229 URL: https://issues.apache.org/jira/browse/SOLR-7229 Project: Solr Issue Type: Improvement Reporter: Tim Allison Priority: Minor With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata of individual attachments/embedded documents. Tika's default handling was to maintain the metadata of the container document and concatenate the contents of all embedded files. With SOLR-7189, we added the legacy behavior. It might be handy, for example, to be able to send an MSG file through DIH and treat the container email as well each attachment as separate (child?) documents, or send a zip of jpeg files and correctly index the geo locations for each image file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents
[ https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356752#comment-14356752 ] Alexandre Rafalovitch commented on SOLR-7229: - If you know they are lat/longs from the metadata type, can't the code just put them into one field straight away? Why both with custom chains. Allow DIH to handle attachments as separate documents - Key: SOLR-7229 URL: https://issues.apache.org/jira/browse/SOLR-7229 Project: Solr Issue Type: Improvement Reporter: Tim Allison Priority: Minor With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata of individual attachments/embedded documents. Tika's default handling was to maintain the metadata of the container document and concatenate the contents of all embedded files. With SOLR-7189, we added the legacy behavior. It might be handy, for example, to be able to send an MSG file through DIH and treat the container email as well each attachment as separate (child?) documents, or send a zip of jpeg files and correctly index the geo locations for each image file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents
[ https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356755#comment-14356755 ] Alexandre Rafalovitch commented on SOLR-7229: - Regarding the implementation, I think the parser should serve as a source for inner entity. Maybe have a flag on parent to pass-down or not pass-down parent's metadata in. Or pass it down but with a consistent prefix, so it could always be filtered out. Allow DIH to handle attachments as separate documents - Key: SOLR-7229 URL: https://issues.apache.org/jira/browse/SOLR-7229 Project: Solr Issue Type: Improvement Reporter: Tim Allison Priority: Minor With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata of individual attachments/embedded documents. Tika's default handling was to maintain the metadata of the container document and concatenate the contents of all embedded files. With SOLR-7189, we added the legacy behavior. It might be handy, for example, to be able to send an MSG file through DIH and treat the container email as well each attachment as separate (child?) documents, or send a zip of jpeg files and correctly index the geo locations for each image file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents
[ https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356812#comment-14356812 ] Tim Allison commented on SOLR-7229: --- Y, that's what I was getting at, and that was the answer I was hoping for. Apologies, I'm still trying to learn the preferences for the boundary between custom hard coding and configuration over here. I'll open another issue to add that. And, on another note, I just noticed that the code that adds metadata is just pulling the first value; in short, if there is a multivalued Solr field, and there's more than one metadata value in the metadata object, the values after the first are being ignored. Looks like another issue. :) Allow DIH to handle attachments as separate documents - Key: SOLR-7229 URL: https://issues.apache.org/jira/browse/SOLR-7229 Project: Solr Issue Type: Improvement Reporter: Tim Allison Priority: Minor With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata of individual attachments/embedded documents. Tika's default handling was to maintain the metadata of the container document and concatenate the contents of all embedded files. With SOLR-7189, we added the legacy behavior. It might be handy, for example, to be able to send an MSG file through DIH and treat the container email as well each attachment as separate (child?) documents, or send a zip of jpeg files and correctly index the geo locations for each image file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents
[ https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356168#comment-14356168 ] David Smiley commented on SOLR-7229: There isn't spatial specific stuff in the DIH, and I don't think there needs to be if we're talking Lat-Lon data. Simply concatenate lat,lon into one string and it'll be handled by the field type appropriately (be it LatLonType, or RPT). Allow DIH to handle attachments as separate documents - Key: SOLR-7229 URL: https://issues.apache.org/jira/browse/SOLR-7229 Project: Solr Issue Type: Improvement Reporter: Tim Allison Priority: Minor With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata of individual attachments/embedded documents. Tika's default handling was to maintain the metadata of the container document and concatenate the contents of all embedded files. With SOLR-7189, we added the legacy behavior. It might be handy, for example, to be able to send an MSG file through DIH and treat the container email as well each attachment as separate (child?) documents, or send a zip of jpeg files and correctly index the geo locations for each image file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents
[ https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356141#comment-14356141 ] Tim Allison commented on SOLR-7229: --- [~dsmiley], do you happen to know off hand if DIH indexes lat/longs from metadata extracted by Tika? If not, that might be a separate issue. We'll want that capability to test this one. Allow DIH to handle attachments as separate documents - Key: SOLR-7229 URL: https://issues.apache.org/jira/browse/SOLR-7229 Project: Solr Issue Type: Improvement Reporter: Tim Allison Priority: Minor With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata of individual attachments/embedded documents. Tika's default handling was to maintain the metadata of the container document and concatenate the contents of all embedded files. With SOLR-7189, we added the legacy behavior. It might be handy, for example, to be able to send an MSG file through DIH and treat the container email as well each attachment as separate (child?) documents, or send a zip of jpeg files and correctly index the geo locations for each image file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org