docs handler needs to do a better job with tweet like JSON structures

Timothy Potter (JIRA) Mon, 13 Oct 2014 11:51:49 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-6617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169749#comment-14169749
 ]


Timothy Potter commented on SOLR-6617:
--------------------------------------

Patch looks good [~noble.paul]. I applied this to my test scenario:

{code}
curl "http://localhost:8983/solr/tutorial/update/json/docs"; -H 
'Content-type:application/json' -d @sample_tweet.json
{code}

Resulted in:

{code}
{
        "user.name": [
          "Stewart Townsend"
        ],
        "user.url": [
          "http://www.stewarttownsend.com";
        ],
        "user.description": [
          "Developer Relations at Datasift (www.datasift.com)  - Car racing 
petrol head, all things social lover, co-founder of www.flowerytweetup.com"
        ],
        "user.location": [
          "iPhone: 53.852402,-2.220047"
        ],
        "user.statuses_count": [
          28247
        ],
        "user.followers_count": [
          3094
        ],
        "user.friends_count": [
          510
        ],
        "user.screen_name": [
          "stewarttownsend"
        ],
        "user.lang": [
          "en"
        ],
        "user.time_zone": [
          "London"
        ],
        "user.listed_count": [
          221
        ],
        "user.id": [
          14065694
        ],
        "user.id_str": [
          14065694
        ],
        "user.geo_enabled": [
          true
        ],
        "id": "136447843652214784",
        "text": [
          "Morning San Francisco - 36 hours and counting.. #datasift"
        ],
        "source": [
          "<a href=\"http://www.tweetdeck.com\"; rel=\"nofollow\">TweetDeck</a>"
        ],
        "created_at": [
          "Tue, 15 Nov 2011 14:17:55 +0000"
        ],
        "_version_": 1481875073806631000
      }
{code}

Which I'd say is very reasonable behavior on Solr's part.  +1 for commit

> /update/json/docs handler needs to do a better job with tweet like JSON 
> structures
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-6617
>                 URL: https://issues.apache.org/jira/browse/SOLR-6617
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Assignee: Noble Paul
>         Attachments: SOLR-6617.patch
>
>
> SOLR-6304 allows me to send in arbitrary JSON document and have Solr do 
> something reasonable with it. I tried this with a simple tweet and got a 
> weird error:
> {code}
> curl "http://localhost:8983/solr/tutorial/update/json/docs"; -H 
> 'Content-type:application/json' -d @sample_tweet.json
> {"responseHeader":{"status":400,"QTime":11},"error":{"msg":"Document contains 
> multiple values for uniqueKey field: id=[14065694, 
> 136447843652214784]","code":400}}
> {code}
> Here's the tweet I'm trying to index:
> {code}
> {
>         "user": {
>             "name": "John Doe",
>             "screen_name": "example",
>             "lang": "en",
>             "time_zone": "London",
>             "listed_count": 221,
>             "id": 14065694,
>             "geo_enabled": true
>         },
>         "id": "136447843652214784",
>         "text": "Morning San Francisco - 36 hours and counting.. #datasift",
>         "created_at": "Tue, 15 Nov 2011 14:17:55 +0000"
> }
> {code}
> The error is because the nested user object within the tweet also has an "id" 
> field. So then I tried to map /user/id to user_id_s via:
> {code}
> curl 
> "http://localhost:8983/solr/tutorial/update/json/docs?f=user_id_s:/user/id"; 
> -H 'Content-type:application/json' -d @sample_tweet.json
> {"responseHeader":{"status":400,"QTime":0},"error":{"msg":"Document is 
> missing mandatory uniqueKey field: id","code":400}}
> {code}
> So then I added the mapping for id explicitly and it worked:
> curl 
> "http://localhost:8983/solr/tutorial/update/json/docs?f=id:/id&f=user_id_s:/user/id";
>  -H 'Content-type:application/json' -d @sample_tweet.json
> {"responseHeader":{"status":0,"QTime":25}}
> Working through this wasn't terrible but our goal with features like this is 
> to have Solr make good decisions when possible to ease the new user's burden 
> of getting to know Solr.
> I'm just wondering if the reasonable thing to do wouldn't be to map the user 
> fields with user_ prefix? ie /user/id becomes user_id automatically.
> Lastly, I wanted to use field guessing with this so my JSON document gets 
> indexed in a reasonable way and the only data that got indexed is:
> {code}
> {
>         "user_id_s": "14065694",
>         "id": "136447843652214784",
>         "_version_": 1481614081193410600
> }
> {code}
> So I explicitly defined the /update/json/docs request handler in my 
> solrconfig.xml as:
> {code}
>   <requestHandler name="/update/json/docs" class="solr.UpdateRequestHandler">
>         <lst name="defaults">
>          <str name="update.chain">add-unknown-fields-to-the-schema</str>
>          <str name="stream.contentType">application/json</str>
>        </lst>
>   </requestHandler>
> {code}
> Same result - no field guessing! (this is using the schemaless example config)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-6617) /update/json/docs handler needs to do a better job with tweet like JSON structures

Reply via email to