[ 
https://issues.apache.org/jira/browse/CONNECTORS-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138630#comment-14138630
 ] 

Karl Wright commented on CONNECTORS-956:
----------------------------------------

I did some research as to what happens in SolrJ right at the moment.

The key method is SolrServer.request(ContentStreamUpdateRequest cs), which in 
the non-Solr-Cloud case we've overridden to fix other bugs to be 
ModifiedHttpSolrServer (which extends the SolrJ class 
org.apache.solr.client.solrj.impl.HttpSolrServer).  What this does for Get, 
Post, and multipart Post is as follows:

Get:
{code}
             method = new HttpGet( baseUrl + path + ClientUtils.toQueryString( 
params, false ) );
{code}
Post:
{code}
                    if (isMultipart) {
                      parts.add(new FormBodyPart(p, new StringBody(v, 
StandardCharsets.UTF_8)));
                    } else {
                      postParams.add(new BasicNameValuePair(p, v));
                    }
{code}
Multipart:
{code}
                post.setEntity(new UrlEncodedFormEntity(postParams, 
StandardCharsets.UTF_8));
                ModifiedMultipartEntity entity = new 
ModifiedMultipartEntity(HttpMultipartMode.STRICT, null, StandardCharsets.UTF_8);
                for(FormBodyPart p: parts) {
                  entity.addPart(p);
                }
                post.setEntity(entity);
{code}
Not multipart:
{code}
                post.setEntity(new UrlEncodedFormEntity(postParams, 
StandardCharsets.UTF_8));
{code}

I believe multipart post and post are therefore safe against illegal parameter 
name characters.  However, ClientUtils.toQueryString( params, false ) is NOT 
safe:

{code}
public static String toQueryString( SolrParams params, boolean xml ) {
    StringBuilder sb = new StringBuilder(128);
    try {
      String amp = xml ? "&" : "&";
      boolean first=true;
      Iterator<String> names = params.getParameterNamesIterator();
      while( names.hasNext() ) {
        String key = names.next();
        String[] valarr = params.getParams( key );
        if( valarr == null ) {
          sb.append( first?"?":amp );
          sb.append(key);
          first=false;
        }
        else {
          for (String val : valarr) {
            sb.append( first? "?":amp );
            sb.append(key);
            if( val != null ) {
              sb.append('=');
              sb.append( URLEncoder.encode( val, "UTF-8" ) );
            }
            first=false;
          }
        }
      }
    }
    catch (IOException e) {throw new RuntimeException(e);}  // can't happen
    return sb.toString();
  }
{code}

I can't override that method, because it's a static and multiple places call 
it.  The best I can do is override the solr server classes that make use of it. 
 That may or may not work; the derivation of (say) 
org.apache.solr.client.solrj.impl.CloudSolrServer is complex.  The concern is 
that we don't control that flow, for the most part, although posts, gets, and 
multipart posts *do* still go through our ModifledHttpSolrServer class.

What I propose to do is to break backwards compatibility in trunk, since it's 
ManifoldCF 2.0 anyway and that is allowed.  If the change seems to work there, 
we can talk about adding a switch in the dev_1x branch.


> Field names are URL encoded
> ---------------------------
>
>                 Key: CONNECTORS-956
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-956
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.6.1
>            Reporter: Piergiorgio Lucidi
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The field names provided by some repositories such as Alfresco are based on 
> an URI similar to:
> {code}
> {http://www.alfresco.org/model/system}store_identifier
> {code}
> But in Solr we found the following field name:
> {code}
> http_3a_2f_2fwww_alfresco_org_2fmodel_2fsystem_2f1_0_7dstore_identifier
> {code}
> The code involved in the Solr connector is the following:
> {code}
> protected static String preEncode(String fieldName)
>   {
>       return URLEncoder.encode(fieldName);
>   }
> {code}
> Probably we should try to solve it removing the preEncode invocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to