Re: Bulk Indexing Problems

Joshua P Tue, 09 Sep 2014 11:44:52 -0700

This is the code I've been using to index: 

I'm going to try to fix the running out of space issue and then try 
slimming down settings. Thank you.


public class Indexer {

    private static final Logger logger = LogManager.getLogger(
"ESBulkUploader");

    public static void main(String[] args) throws IOException, 
NoSuchFieldException {

        DBConnection dbConn = new DBConnection("");

        String query = "SELECT TOP 300000 * FROM vw_PropertyGeneralInfo 
WHERE Country_id = 1 ORDER BY Property_id DESC";

        System.out.println("getting data");
        List<PropertyGeneralInfoRow> pgiTable =  dbConn.
ExecuteQueryWithoutParameters(query);
        System.out.println("got data");

        ObjectMapper mapper = new ObjectMapper();

        Settings settings = ImmutableSettings.settingsBuilder().put(
"cluster.name", "property_transaction_data").build();

        Client client = new TransportClient(settings).addTransportAddress(
new InetSocketTransportAddress("192.168.133.131", 9300));

        BulkProcessor bulkProcessor = BulkProcessor.builder(client, new 
BulkProcessor.Listener() {
            @Override
            public void beforeBulk(long executionId, BulkRequest request) {
                System.out.println("About to index " + request.
numberOfActions() + " records of size " + request.estimatedSizeInBytes() + 
".");
            }

            @Override
            public void afterBulk(long executionId, BulkRequest request, 
BulkResponse response) {
                if( response.hasFailures() ){
                    for( BulkItemResponse item : response.getItems() ){
                        BulkItemResponse.Failure failure = item.getFailure
();
                        if( failure != null ){
                            System.out.println(failure.getId() + " -- " + 
failure.getStatus().name() + " -- " + failure.getMessage() + " -- " + 
failure.getType());
                        }
                    }
                }

                System.out.println("Successfully indexed " + request.
numberOfActions() + " records in " + response.getTook() + ".");
            }

            @Override
            public void afterBulk(long executionId, BulkRequest request, 
Throwable failure) {
                System.out.println("failure somewhere on " + request.
toString());
                failure.printStackTrace();
                logger.warn("failure on " + request.toString());
            }
        }).setBulkActions(500).setConcurrentRequests(1).build();

        for( int i = 0; i < pgiTable.size(); i++ ){
            //prep location field
            PropertyGeneralInfoRow pgiRow = pgiTable.get(i);

            Double[] location = {pgiRow.getLon_dbl(), pgiRow.getLat_dbl()};

            geocode geocode = new geocode();

            geocode.setLocation(location);

            pgiRow.setGeocode(geocode);

            // prep full address string
            pgiRow.setFulladdressstring(pgiRow.getPropertykey_tx() + ", " +
                    pgiRow.getCity_tx() + ", " + pgiRow.getStateprov_cd() +
                    ", " + pgiRow.getCountry_tx() + ", " + pgiRow.
getPostalcode_tx());

            String jsonRow = mapper.writeValueAsString(pgiRow);

            if( jsonRow != null && !jsonRow.isEmpty() && !jsonRow.equals(
"{}") ){
                bulkProcessor.add(new IndexRequest("rcapropertydata", 
"rcaproperty").source(jsonRow.getBytes()));
                // bulkProcessor.add(client.prepareIndex("rcapropertydata", 
"rcaproperty").setSource(jsonRow));
            }
            else{
                // don't add null strings..
                try{
                    System.out.println(pgiRow.toString());
                }
                catch (Exception e){
                    System.out.println("Some error in the toString() 
method...");
                }
                System.out.println("Some json output was null. -- " + pgiRow
.getProperty_id().toString());
            }

        }

        bulkProcessor.flush();
        bulkProcessor.close();

    }



}



On Tuesday, September 9, 2014 1:57:54 PM UTC-4, Jörg Prante wrote:
>
> Check the path.data setting in config/elasticsearch.yml
>
> Jörg
>
> On Tue, Sep 9, 2014 at 7:50 PM, Joshua P <[email protected] 
> <javascript:>> wrote:
>
>> Just reran the indexer and found this error coming up. I'm running out of 
>> disk space on the partition ES wants to write to.
>>
>> F38KqHhnRDWtiJCss5Wz0g -- INTERNAL_SERVER_ERROR -- 
>> TranslogException[[index_type][0] Failed to write operation 
>> [org.elasticsearch.index.translog.Translog$Create@6f1f6b1e]]; nested: 
>> IOException[No space left on device];  -- index_type
>>
>> Where would I change the write location? Which config file? 
>>
>> On Tuesday, September 9, 2014 1:28:21 PM UTC-4, Joshua P wrote:
>>>
>>> Hi Jörg, 
>>>
>>> Can you elaborate on what you mean by I still need more fine tuning? 
>>>
>>> I've upped the heap size to 4g (in both places I mentioned before 
>>> because it's not clear to me which one ES actually uses). I haven't tried 
>>> to index again yet. 
>>> Other than throttling my indexing, what are some other things I need to 
>>> be thinking about? 
>>>
>>> On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:
>>>>
>>>> Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and 
>>>> indexing around 1 million docs, you need some more fine tuning, which is 
>>>> complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8 
>>>> GB RAM.
>>>>
>>>> Jörg
>>>>
>>>> On Tue, Sep 9, 2014 at 5:39 PM, Joshua P <[email protected]> wrote:
>>>>
>>>>> Here is /etc/default/elasticsearch
>>>>>
>>>>> # Run Elasticsearch as this user ID and group ID
>>>>> #ES_USER=elasticsearch
>>>>> #ES_GROUP=elasticsearch
>>>>>
>>>>> # Heap Size (defaults to 256m min, 1g max)
>>>>> ES_HEAP_SIZE=512m
>>>>>
>>>>> # Heap new generation
>>>>> #ES_HEAP_NEWSIZE=
>>>>>
>>>>> # max direct memory
>>>>> #ES_DIRECT_SIZE=
>>>>>
>>>>> # Maximum number of open files, defaults to 65535.
>>>>> MAX_OPEN_FILES=65535
>>>>>
>>>>> # Maximum locked memory size. Set to "unlimited" if you use the
>>>>> # bootstrap.mlockall option in elasticsearch.yml. You must also set
>>>>> # ES_HEAP_SIZE.
>>>>> MAX_LOCKED_MEMORY=unlimited
>>>>>
>>>>> # Maximum number of VMA (Virtual Memory Areas) a process can own
>>>>> #MAX_MAP_COUNT=262144
>>>>>
>>>>> # Elasticsearch log directory
>>>>> #LOG_DIR=/var/log/elasticsearch
>>>>>
>>>>> # Elasticsearch data directory
>>>>> #DATA_DIR=/var/lib/elasticsearch
>>>>>
>>>>> # Elasticsearch work directory
>>>>> #WORK_DIR=/tmp/elasticsearch
>>>>>
>>>>> # Elasticsearch configuration directory
>>>>> #CONF_DIR=/etc/elasticsearch
>>>>>
>>>>> # Elasticsearch configuration file (elasticsearch.yml)
>>>>> #CONF_FILE=/etc/elasticsearch/elasticsearch.yml
>>>>>
>>>>> # Additional Java OPTS
>>>>> #ES_JAVA_OPTS=
>>>>>
>>>>> # Configure restart on package upgrade (true, every other setting will 
>>>>> lead to not restarting)
>>>>> #RESTART_ON_UPGRADE=true
>>>>>
>>>>> I also see the same setting in /etc/init.d/elasticsearch. Do you know 
>>>>> which file takes priority? And what a good size would be? 
>>>>>
>>>>> On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:
>>>>>>
>>>>>> Hello Joshua , 
>>>>>>
>>>>>> I am not sure which variable you are referring to on the memory 
>>>>>> settings in the config file , please paste the comment and config.
>>>>>> I usually change the config from init.d script.
>>>>>>
>>>>>> Best approach would be to bulk index say 10,000 feeds in sync mode , 
>>>>>> wait until is everything is indexed and then proceed to the next batch.
>>>>>> I am not sure about the java API , but long back i used to curl to 
>>>>>> this stats API and see how much request was rejected.
>>>>>>
>>>>>> Thanks
>>>>>>           Vineeth
>>>>>>
>>>>>> On Tue, Sep 9, 2014 at 8:58 PM, Joshua P <[email protected]> wrote:
>>>>>>
>>>>>>> You also said you wouldn't recommend indexing that much information 
>>>>>>> at once. How would you suggest breaking it up and what status should I 
>>>>>>> look 
>>>>>>> for before doing another batch? I have to come up with some process 
>>>>>>> that is 
>>>>>>> repeatable and mostly automated. 
>>>>>>>
>>>>>>> On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:
>>>>>>>>
>>>>>>>> Thanks for the reply, Vineeth! 
>>>>>>>>
>>>>>>>> What's a practical heap size? I've seen some people saying they set 
>>>>>>>> it to 30gb but this confuses me because in the 
>>>>>>>> /etc/default/elasticsearch 
>>>>>>>> file, the comment suggests the max is only 1gb? 
>>>>>>>>
>>>>>>>> I'll look into the threadpool issue. Is there a Java API for 
>>>>>>>> monitoring Cluster Node health? Can you point me at an example or give 
>>>>>>>> me a 
>>>>>>>> link to that? 
>>>>>>>>
>>>>>>>> Thanks! 
>>>>>>>>
>>>>>>>> On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hello Joshuva ,
>>>>>>>>>
>>>>>>>>> I have a feeling this has something to do with the threadpool.
>>>>>>>>> There is a limit on number of feeds to be queued for indexing.
>>>>>>>>>
>>>>>>>>> Try increasing the size of threadpool queue of index and bulk to a 
>>>>>>>>> large number.
>>>>>>>>> Also through cluster node API on threadpool, you can see if any 
>>>>>>>>> request has failed.
>>>>>>>>> Monitor this API for any failed request due to large volume.
>>>>>>>>>
>>>>>>>>> Threadpool - http://www.elasticsearch.org/guide/en/elasticsearch/
>>>>>>>>> reference/current/modules-threadpool.html
>>>>>>>>> Threadpool stats - http://www.elasticsearch.org
>>>>>>>>> /guide/en/elasticsearch/reference/current/cluster-nodes-stats.html
>>>>>>>>>
>>>>>>>>> Having said that , i wont recommend bulk indexing that much 
>>>>>>>>> information at a time and 512 MB is not going to help much.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>           Vineeth
>>>>>>>>>
>>>>>>>>> On Tue, Sep 9, 2014 at 7:48 PM, Joshua P <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi there! 
>>>>>>>>>>
>>>>>>>>>> I'm trying to do a one-time index of about 800,000 records into 
>>>>>>>>>> an instance of elasticsearch. But I'm having a bit of trouble. It 
>>>>>>>>>> continually fails around 200,000 records. Looking at in the 
>>>>>>>>>> Elasticsearch 
>>>>>>>>>> Head Plugin, my index goes offline and becomes unrecoverable. 
>>>>>>>>>>
>>>>>>>>>> For now, I have it running on a VM on my personal machine. 
>>>>>>>>>>
>>>>>>>>>> VM Config: 
>>>>>>>>>> Ubuntu Server 14.04 64-Bit
>>>>>>>>>> 8 GB RAM
>>>>>>>>>> 2 Processors
>>>>>>>>>> 32 GB SSD
>>>>>>>>>>
>>>>>>>>>> Java
>>>>>>>>>> java version "1.7.0_65"
>>>>>>>>>> OpenJDK Runtime Environment (IcedTea 2.5.1) 
>>>>>>>>>> (7u65-2.5.1-4ubuntu1~0.14.04.2)
>>>>>>>>>> OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
>>>>>>>>>>
>>>>>>>>>> Elasticsearch is using mostly the defaults. This is the output 
>>>>>>>>>> of: 
>>>>>>>>>> curl http://localhost:9200/_nodes/process?pretty
>>>>>>>>>> {
>>>>>>>>>>   "cluster_name" : "property_transaction_data",
>>>>>>>>>>   "nodes" : {
>>>>>>>>>>     "KlFkO_qgSOKmV_jjj5xeVw" : {
>>>>>>>>>>       "name" : "Marvin Flumm",
>>>>>>>>>>       "transport_address" : "inet[/192.168.133.131:9300]",
>>>>>>>>>>       "host" : "ubuntu-es",
>>>>>>>>>>       "ip" : "127.0.1.1",
>>>>>>>>>>       "version" : "1.3.2",
>>>>>>>>>>       "build" : "dee175d",
>>>>>>>>>>       "http_address" : "inet[/192.168.133.131:9200]",
>>>>>>>>>>       "process" : {
>>>>>>>>>>         "refresh_interval_in_millis" : 1000,
>>>>>>>>>>         "id" : 1092,
>>>>>>>>>>         "max_file_descriptors" : 65535,
>>>>>>>>>>         "mlockall" : true
>>>>>>>>>>       }
>>>>>>>>>>     }
>>>>>>>>>>   }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> I adjusted ES_HEAP_SIZE to 512mb. 
>>>>>>>>>>
>>>>>>>>>> I'm using the following code to pull data from SQL Server and 
>>>>>>>>>> index it. 
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "elasticsearch" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3
>>>>>>>>>> f-462f-bdcf-df717cbc6269%40googlegroups.com 
>>>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/0dcac495-a07
>>>>>>> 1-4644-9349-109071fb1855%40googlegroups.com 
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%
>>>>> 40googlegroups.com 
>>>>> <https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7618bd05-9a0f-4248-8f16-0950198473db%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Bulk Indexing Problems

Reply via email to