David – can you advise if there is anything which can be done to speed up 
indexing? Are there any config parameters I could use to tweak the performance?



Joerg – the entire indexing looks completely differently. If I do not use the 
mapper attachment, the size of the index and the number of indexed documents 
grow at the same time. With mapper attachments, however, the size of the index 
grows with the number of indexed documents staying at 0 until the entire index 
is being built.





From: [email protected] [mailto:[email protected]] On 
Behalf Of [email protected]
Sent: Monday, February 23, 2015 5:13 PM
To: [email protected]
Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



I am not sure but it looks like mapper attachment is doing some extra 
processing, for example Tika, which is very expensive. Maybe there is some 
configuration option, I did not check.



Jörg



On Mon, Feb 23, 2015 at 2:13 PM, Jiri Pik 
<[email protected]<mailto:[email protected]>> wrote:

   The table has 3000 rows, the index is defined as below



   {

   {

   "index": {

   "primary_size_in_bytes": 296341451,

   "size_in_bytes": 296341451

   },

   "translog": {

   "operations": 0

   },

   "docs": {

   "num_docs": 3000,

   "max_doc": 3000,

   "deleted_docs": 0

   },},





   I believe it’s the mapper attachment who is causing this delay.



   David – is there any way to speed this up?



   From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]<mailto:[email protected]>] 
On Behalf Of [email protected]<mailto:[email protected]>
   Sent: Monday, February 23, 2015 10:15 AM


   To: [email protected]<mailto:[email protected]>
   Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



   How big is the entire table you index?



   You can use monitor tools like BigDesk to verify the resources ES is using.



   It is close to impossible that just base64 encoding takes 20x longer while 
indexing, maybe mapper attachment is doing other extra work.



   Jörg



   On Mon, Feb 23, 2015 at 9:50 AM, Jiri Pik 
<[email protected]<mailto:[email protected]>> wrote:

      Thank you for opening of the issue.



      If I indexed the column as varchar and used the default ES indexing, the 
entire table is indexed within 5 seconds. If I use the Mapper Attachments, it 
takes up to 2 minutes. I am not sure whether it’s because of the extra work SQL 
Server is doing, or the extra volume the jdbc is taking care, but I assume it 
may be because of the way the Mapper Attachments works?







      From: 
[email protected]<mailto:[email protected]> 
[mailto:[email protected]<mailto:[email protected]>] 
On Behalf Of [email protected]<mailto:[email protected]>
      Sent: Monday, February 23, 2015 9:26 AM
      To: [email protected]<mailto:[email protected]>
      Subject: Re: FW: Indexing of HTML Column in an MS SQL Server 2014 database



      1. I opened an issue for adding optional base64 encoding on columns: 
https://github.com/jprante/elasticsearch-river-jdbc/issues/472



      2. What is "initial indexing"? What do you mean by "slower"?



      3. Yes, you can change the documented bulk index settings.



      Jörg





      On Mon, Feb 23, 2015 at 6:12 AM, Jiri Pik 
<[email protected]<mailto:[email protected]>> wrote:

         Apologies for everyone for sending these emails with digital signature 
which may have caused some issues:



         Summary for Joerg:



         1.       Is there a way for the JDBC river to transform the 
nvarchar(MAX) into Base64 by itself? I can do on SQL server – see below (1) for 
David – but it’s substantially slower

         2.       If not, do you recommend nvarbinary(MAX) or some other MS SQL 
Server type? And then the SELECT * from XXX would just work?



         Summary for David:

         1.       If I convert the HTML column using select ID, cast(N'' as 
xml).value ('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 
'varchar(max)') as Content from (SELECT ID ,  cast( cast(Content as varchar(MAX 
)) as varbinary( MAX)) Content from KBArticles) k; the indexing just works but 
takes longer than usual – is there any performance setting I could use?

         2.       Would it be possible for the attachment mapper to index pure 
txt file without base64?













         From: Jiri Pik
         Sent: Monday, February 23, 2015 6:08 AM
         To: 
[email protected]<mailto:[email protected]>
         Subject: RE: Indexing of HTML Column in an MS SQL Server 2014 database



         Thank you very much for your kind answer. If I encode the html file 
into Base64, and use the enclosed script, then all works just fine.



         So, Joerg:



         1.       Is there a way for the JDBC river to transform the 
nvarchar(MAX) into Base64 by itself?

         2.       If not, do you recommend nvarbinary(MAX) or some other MS SQL 
Server type? And then the SELECT * from XXX would just work?



         What are your thoughts?



         BTW I have been able to convert the nvarchar to base64 using this query

         select ID, cast(N'' as xml).value 
('xs:base64Binary(xs:hexBinary(sql:column("k.Content")))', 'varchar(max)') as 
Content from (SELECT ID ,  cast( cast(Content as varchar(MAX )) as varbinary( 
MAX)) Content from KBArticles) k;





         The usual river and mapper attachment work just fine but the initial 
indexing takes substantially longer. Why?



         3.       Is there any performance settings I could tweak?



         From: 
[email protected]<mailto:[email protected]> 
[mailto:[email protected]] On Behalf Of 
[email protected]<mailto:[email protected]>
         Sent: Sunday, February 22, 2015 6:12 PM
         To: 
[email protected]<mailto:[email protected]>
         Subject: Re: Indexing of HTML Column in an MS SQL Server 2014 database



         Can you give some information about the mapper attachment setup you 
used successfully?



         There is no good reason why this should not be possible with JDBC 
river.



         Jörg



         On Sun, Feb 22, 2015 at 5:20 PM, Jiri Pik 
<[email protected]<mailto:[email protected]>> wrote:

            I need to index a HTML column (nvarchar(MAX)) in a MS SQL Server 
database. I have set up a JDBC river 
https://github.com/jprante/elasticsearch-river-jdbc and the database is indexed.

            Using

              "settings":{

                "analysis":{

                  "analyzer":{

                    "default":{

                      "type":"custom",

                      "tokenizer":"standard",

                      "filter":[ "standard", "lowercase" ],

                      "char_filter" : ["html_strip"]

                    }

                  }

                }

              }

            is good for searching but not for the highlighter as that returns 
sometimes trimmed unpaired html tags.

            I have played with the Mapper Attachments with HTML attachments and 
then the highlighter works well - all original html tags are gone - but I am 
unable to get the river push the column directly to the Mapper Attachments.

            Questions:

            1. what is the best practice for indexing HTML columns? I am aware 
of the possibility of a manual removal of HTML tags using Agility Pack but do 
not like that as it's too much extra maintenance.

            2. is there any better highlighter for html data which doesn't cut 
off any original html tags?

            3. How to plug in the JDBC river to Mapper Attachments?

            4. Any better ideas how to achieve my goals?



            Thanks!

            --
            You received this message because you are subscribed to the Google 
Groups "elasticsearch" group.
            To unsubscribe from this group and stop receiving emails from it, 
send an email to 
[email protected]<mailto:[email protected]>.
            To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/f175734b-0889-40a9-96d1-d46702e56666%40googlegroups.com?utm_medium=email&utm_source=footer>.
            For more options, visit https://groups.google.com/d/optout.



         --
         You received this message because you are subscribed to the Google 
Groups "elasticsearch" group.
         To unsubscribe from this group and stop receiving emails from it, send 
an email to 
[email protected]<mailto:[email protected]>.

         To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH6Ei%2B23bRKrL0Z7WkQALengfhaZeJRBq5gK1F22yxJfg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
         For more options, visit https://groups.google.com/d/optout.

         --
         You received this message because you are subscribed to the Google 
Groups "elasticsearch" group.
         To unsubscribe from this group and stop receiving emails from it, send 
an email to 
[email protected]<mailto:[email protected]>.
         To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net<https://groups.google.com/d/msgid/elasticsearch/a5258a9fb35548b186333e442238331c%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>.


         For more options, visit https://groups.google.com/d/optout.



      --
      You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
      To unsubscribe from this group and stop receiving emails from it, send an 
email to 
[email protected]<mailto:[email protected]>.

      To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaJKN9Q5Rsu8XqLpEWafyPK_YBA7rGvMX7R-9T4Odiuw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
      For more options, visit https://groups.google.com/d/optout.

      --
      You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
      To unsubscribe from this group and stop receiving emails from it, send an 
email to 
[email protected]<mailto:[email protected]>.
      To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net<https://groups.google.com/d/msgid/elasticsearch/a9c9114b28384485b3f4d6290d5a2da0%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>.


      For more options, visit https://groups.google.com/d/optout.



   --
   You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
   To unsubscribe from this group and stop receiving emails from it, send an 
email to 
[email protected]<mailto:[email protected]>.

   To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHw3oba_%3DAGAnYofoeHY%3Dx5JDwdSPmRhEcPdmMkHUEQwQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
   For more options, visit https://groups.google.com/d/optout.

   --
   You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
   To unsubscribe from this group and stop receiving emails from it, send an 
email to 
[email protected]<mailto:[email protected]>.
   To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net<https://groups.google.com/d/msgid/elasticsearch/eb5ebcabc33a4e82a726a936733fdd28%40Ex13DAG10-N1.dataoncloud.net?utm_medium=email&utm_source=footer>.


   For more options, visit https://groups.google.com/d/optout.



   --
   You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
   To unsubscribe from this group and stop receiving emails from it, send an 
email to 
[email protected]<mailto:[email protected]>.
   To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMyR-nMK%3Dymrec%2B3KF1nySUNJw7%2BSR0KcxHsxq7CrKdQ%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMyR-nMK%3Dymrec%2B3KF1nySUNJw7%2BSR0KcxHsxq7CrKdQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
   For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/33a3cd982b1743e49520143e773600db%40Ex13DAG10-N1.dataoncloud.net.
For more options, visit https://groups.google.com/d/optout.

Reply via email to