[jira] [Commented] (CONNECTORS-916) Amazon CloudSearch output connector

Karl Wright (JIRA) Mon, 19 May 2014 10:43:35 -0700

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002049#comment-14002049
 ]


Karl Wright commented on CONNECTORS-916:
----------------------------------------

Hi Takumi,

I see a couple of issues with this patch.

(1) There is already a poi download; you can't download two different versions 
of poi to the same place and have anything sensible happen.
(2) The patch includes a large number of new jars.  Somebody needs to do 
research to find out with the license is for each of these jars, and they will 
need to be mentioned in the appropriate license file.  If any of the jars is 
not an Apache-approved license, something else will need to be done.  If you 
can summarize the jars being added and their licenses, I can write the actual 
license update.
(3) There's some nonsense code in the build file I don't understand, for 
example:

{code}
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
{code}

(4) I see a lot of diffs of this kind, which don't appear to be changing 
anything.  What is happening here?

{code}
-  {
-    // Establish a session
+  {
+    // Establish a session
     getSession();
     
     SpecPacker sp = new SpecPacker(outputDescription);
     
-    String jsondata = "";
+    String jsondata = "";
{code}

(5) I have some concerns about this:

{code}
+      // generate document id from document URI using md5 checksum.
+      // document id must be shorter than 128 characters because of 
AmazonCloudSearch.
+      try {
+        doc.setId(URLEncoder.encode(generateid(documentURI),"utf-8"));
+      } catch (NoSuchAlgorithmException e) {
+        throw new ManifoldCFException(e);
+      }
+      
{code}

Generally, since hashes have a possibility of colliding, it is better to send 
the URI as metadata, and then look up the document identifier by searching for 
the document metadata URI.  You can create a document identifier then only if 
you don't find the matching record.  If that is not possible, I'd still 
structure the code so that there's a single place where the hashing is done.  
You might also look at and use:

http://manifoldcf.apache.org/release/trunk/api/framework/org/apache/manifoldcf/core/system/ManifoldCF.html#encrypt%28java.lang.String%29




Why does this need to be included 5 times? xercesImpl is also included twice 
too.




> Amazon CloudSearch output connector
> -----------------------------------
>
>                 Key: CONNECTORS-916
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-916
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Amazon CloudSearch output connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Takumi Yoshida
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>         Attachments: 0507.diff, 0520.diff, 1.patch, 2.diff, 3.diff, 
> AmazonCloudSearchParam.java, AmazonCloudSearchSpecs.java, 
> exception_handling.diff, exception_handling_2.diff
>
>
> I wrote some codes snipetts of output connector for Amazon CloudSearch.
> I would like you to review my code. You can crawl web site and feed HTML page 
> to Amazon CloudSearch.
> but it is not perfectly completed followoing reason.
> - does not write any codes for configuration page.
> - supporting file type is only HTML
> Thank you for your time,
>  Takumi Yoshida



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CONNECTORS-916) Amazon CloudSearch output connector

Reply via email to