[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents

2019-03-04 Thread Nazerke Seidan (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783547#comment-16783547
 ] 

Nazerke Seidan commented on SOLR-7229:
--

Hi Tim,

I was wondering whether this project is still open or not? I would like to 
participate in GSoC'19 by contributing to solr community. 

> Allow DIH to handle attachments as separate documents
> -
>
> Key: SOLR-7229
> URL: https://issues.apache.org/jira/browse/SOLR-7229
> Project: Solr
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Alexandre Rafalovitch
>Priority: Minor
>  Labels: gsoc2017
>
> With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata 
> of individual attachments/embedded documents.  Tika's default handling was to 
> maintain the metadata of the container document and concatenate the contents 
> of all embedded files.  With SOLR-7189, we added the legacy behavior.
> It might be handy, for example, to be able to send an MSG file through DIH 
> and treat the container email as well each attachment as separate (child?) 
> documents, or send a zip of jpeg files and correctly index the geo locations 
> for each image file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents

2016-10-03 Thread Alexandre Rafalovitch (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542670#comment-15542670
 ] 

Alexandre Rafalovitch commented on SOLR-7229:
-

I haven't started working on this yet. Just assigned it to myself to ensure it 
is not lost. If you have any additional thoughts or implementation ideas, feel 
free to contribute.

> Allow DIH to handle attachments as separate documents
> -
>
> Key: SOLR-7229
> URL: https://issues.apache.org/jira/browse/SOLR-7229
> Project: Solr
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Alexandre Rafalovitch
>Priority: Minor
>
> With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata 
> of individual attachments/embedded documents.  Tika's default handling was to 
> maintain the metadata of the container document and concatenate the contents 
> of all embedded files.  With SOLR-7189, we added the legacy behavior.
> It might be handy, for example, to be able to send an MSG file through DIH 
> and treat the container email as well each attachment as separate (child?) 
> documents, or send a zip of jpeg files and correctly index the geo locations 
> for each image file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents

2016-10-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542313#comment-15542313
 ] 

Tim Allison commented on SOLR-7229:
---

Let me know how I can help.

> Allow DIH to handle attachments as separate documents
> -
>
> Key: SOLR-7229
> URL: https://issues.apache.org/jira/browse/SOLR-7229
> Project: Solr
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Alexandre Rafalovitch
>Priority: Minor
>
> With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata 
> of individual attachments/embedded documents.  Tika's default handling was to 
> maintain the metadata of the container document and concatenate the contents 
> of all embedded files.  With SOLR-7189, we added the legacy behavior.
> It might be handy, for example, to be able to send an MSG file through DIH 
> and treat the container email as well each attachment as separate (child?) 
> documents, or send a zip of jpeg files and correctly index the geo locations 
> for each image file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents

2015-03-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356721#comment-14356721
 ] 

Tim Allison commented on SOLR-7229:
---

Got it.  To confirm, the idiomatic way to do this would be to configure the 
TikaEntityProcessor to create fields for latitude and longitude and then apply 
a combination of CloneFieldUpdateProcessorFactory and 
ConcatFieldUpdateProcessorFactory to do the concatenation?  Is there a way to 
configure the concatenation without creating separate latitude and longitude 
fields? 

 Allow DIH to handle attachments as separate documents
 -

 Key: SOLR-7229
 URL: https://issues.apache.org/jira/browse/SOLR-7229
 Project: Solr
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor

 With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata 
 of individual attachments/embedded documents.  Tika's default handling was to 
 maintain the metadata of the container document and concatenate the contents 
 of all embedded files.  With SOLR-7189, we added the legacy behavior.
 It might be handy, for example, to be able to send an MSG file through DIH 
 and treat the container email as well each attachment as separate (child?) 
 documents, or send a zip of jpeg files and correctly index the geo locations 
 for each image file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents

2015-03-11 Thread Alexandre Rafalovitch (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356752#comment-14356752
 ] 

Alexandre Rafalovitch commented on SOLR-7229:
-

If you know they are lat/longs from the metadata type, can't the code just put 
them into one field straight away? Why both with custom chains.

 Allow DIH to handle attachments as separate documents
 -

 Key: SOLR-7229
 URL: https://issues.apache.org/jira/browse/SOLR-7229
 Project: Solr
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor

 With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata 
 of individual attachments/embedded documents.  Tika's default handling was to 
 maintain the metadata of the container document and concatenate the contents 
 of all embedded files.  With SOLR-7189, we added the legacy behavior.
 It might be handy, for example, to be able to send an MSG file through DIH 
 and treat the container email as well each attachment as separate (child?) 
 documents, or send a zip of jpeg files and correctly index the geo locations 
 for each image file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents

2015-03-11 Thread Alexandre Rafalovitch (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356755#comment-14356755
 ] 

Alexandre Rafalovitch commented on SOLR-7229:
-

Regarding the implementation, I think the parser should serve as a source for 
inner entity. Maybe have a flag on parent to pass-down or not pass-down 
parent's metadata in. Or pass it down but with a consistent prefix, so it could 
always be filtered out.

 Allow DIH to handle attachments as separate documents
 -

 Key: SOLR-7229
 URL: https://issues.apache.org/jira/browse/SOLR-7229
 Project: Solr
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor

 With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata 
 of individual attachments/embedded documents.  Tika's default handling was to 
 maintain the metadata of the container document and concatenate the contents 
 of all embedded files.  With SOLR-7189, we added the legacy behavior.
 It might be handy, for example, to be able to send an MSG file through DIH 
 and treat the container email as well each attachment as separate (child?) 
 documents, or send a zip of jpeg files and correctly index the geo locations 
 for each image file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents

2015-03-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356812#comment-14356812
 ] 

Tim Allison commented on SOLR-7229:
---

Y, that's what I was getting at, and that was the answer I was hoping for.  
Apologies, I'm still trying to learn the preferences for the boundary between 
custom hard coding and configuration over here.  I'll open another issue to add 
that.  

And, on another note, I just noticed that the code that adds metadata is just 
pulling the first value; in short, if there is a multivalued Solr field, and 
there's more than one metadata value in the metadata object, the values after 
the first are being ignored.  Looks like another issue. :)

 Allow DIH to handle attachments as separate documents
 -

 Key: SOLR-7229
 URL: https://issues.apache.org/jira/browse/SOLR-7229
 Project: Solr
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor

 With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata 
 of individual attachments/embedded documents.  Tika's default handling was to 
 maintain the metadata of the container document and concatenate the contents 
 of all embedded files.  With SOLR-7189, we added the legacy behavior.
 It might be handy, for example, to be able to send an MSG file through DIH 
 and treat the container email as well each attachment as separate (child?) 
 documents, or send a zip of jpeg files and correctly index the geo locations 
 for each image file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents

2015-03-10 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356168#comment-14356168
 ] 

David Smiley commented on SOLR-7229:


There isn't spatial specific stuff in the DIH, and I don't think there needs to 
be if we're talking Lat-Lon data.  Simply concatenate lat,lon into one string 
and it'll be handled by the field type appropriately (be it LatLonType, or RPT).

 Allow DIH to handle attachments as separate documents
 -

 Key: SOLR-7229
 URL: https://issues.apache.org/jira/browse/SOLR-7229
 Project: Solr
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor

 With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata 
 of individual attachments/embedded documents.  Tika's default handling was to 
 maintain the metadata of the container document and concatenate the contents 
 of all embedded files.  With SOLR-7189, we added the legacy behavior.
 It might be handy, for example, to be able to send an MSG file through DIH 
 and treat the container email as well each attachment as separate (child?) 
 documents, or send a zip of jpeg files and correctly index the geo locations 
 for each image file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7229) Allow DIH to handle attachments as separate documents

2015-03-10 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356141#comment-14356141
 ] 

Tim Allison commented on SOLR-7229:
---

[~dsmiley], do you happen to know off hand if DIH indexes lat/longs from 
metadata extracted by Tika?  If not, that might be a separate issue.  We'll 
want that capability to test this one.

 Allow DIH to handle attachments as separate documents
 -

 Key: SOLR-7229
 URL: https://issues.apache.org/jira/browse/SOLR-7229
 Project: Solr
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor

 With Tika 1.7's RecursiveParserWrapper, it is possible to maintain metadata 
 of individual attachments/embedded documents.  Tika's default handling was to 
 maintain the metadata of the container document and concatenate the contents 
 of all embedded files.  With SOLR-7189, we added the legacy behavior.
 It might be handy, for example, to be able to send an MSG file through DIH 
 and treat the container email as well each attachment as separate (child?) 
 documents, or send a zip of jpeg files and correctly index the geo locations 
 for each image file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org