Re: Bulk Indexing Problems

Joshua P Tue, 09 Sep 2014 18:32:52 -0700

Thanks! 

Turns out I was using less space on the VM than I thought; that with a lack 
of decent error checking and I didn't catch the out-of-space problem. As 
soon as I added more space, I was able to index everything without a 
problem.


Thanks again. 

On Tuesday, September 9, 2014 6:49:35 PM UTC-4, Jörg Prante wrote:
>
> Code looks okay, so it might be just the full volume that is in the way
>
> Jörg
>
> On Tue, Sep 9, 2014 at 8:44 PM, Joshua P <[email protected] 
> <javascript:>> wrote:
>
> This is the code I've been using to index: 
>
> I'm going to try to fix the running out of space issue and then try 
> slimming down settings. Thank you. 
>
> public class Indexer {
>
>     private static final Logger logger = LogManager.getLogger(
> "ESBulkUploader");
>
>     public static void main(String[] args) throws IOException, 
> NoSuchFieldException {
>
>         DBConnection dbConn = new DBConnection("");
>
>         String query = "SELECT TOP 300000 * FROM vw_PropertyGeneralInfo 
> WHERE Country_id = 1 ORDER BY Property_id DESC";
>
>         System.out.println("getting data");
>         List<PropertyGeneralInfoRow> pgiTable =  dbConn.
> ExecuteQueryWithoutParameters(query);
>         System.out.println("got data");
>
>         ObjectMapper mapper = new ObjectMapper();
>
>         Settings settings = ImmutableSettings.settingsBuilder().put("
> cluster.name", "property_transaction_data").build();
>
>         Client client = new TransportClient(settings).addTransportAddress(
> new InetSocketTransportAddress("192.168.133.131", 9300));
>
>         BulkProcessor bulkProcessor = BulkProcessor.builder(client, new 
> BulkProcessor.Listener() {
>             @Override
>             public void beforeBulk(long executionId, BulkRequest request) 
> {
>                 System.out.println("About to index " + request.
> numberOfActions() + " records of size " + request.estimatedSizeInBytes() + 
> ".");
>             }
>
>             @Override
>             public void afterBulk(long executionId, BulkRequest request, 
> BulkResponse response) {
>                 if( response.hasFailures() ){
>                     for( BulkItemResponse item : response.getItems() ){
>                         BulkItemResponse.Failure failure = item.getFailure
> ();
>                         if( failure != null ){
>                             System.out.println(failure.getId() + " -- " + 
> failure.getStatus().name() + " -- " + failure.getMessage() + " -- " + 
> failure.getType());
>                         }
>                     }
>                 }
>
>                 System.out.println("Successfully indexed " + request.
> numberOfActions() + " records in " + response.getTook() + ".");
>             }
>
>             @Override
>             public void afterBulk(long executionId, BulkRequest request, 
> Throwable failure) {
>                 System.out.println("failure somewhere on " + request.
> toString());
>                 failure.printStackTrace();
>                 logger.warn("failure on " + request.toString());
>             }
>         }).setBulkActions(500).setConcurrentRequests(1).build();
>
>         for( int i = 0; i < pgiTable.size(); i++ ){
>             //prep location field
>             PropertyGeneralInfoRow pgiRow = pgiTable.get(i);
>
>             Double[] location = {pgiRow.getLon_dbl(), pgiRow.getLat_dbl
> ()};
>
>             geocode geocode = new geocode();
>
>             geocode.setLocation(location);
>
>             pgiRow.setGeocode(geocode);
>
>             // prep full address string
>             pgiRow.setFulladdressstring(pgiRow.getPropertykey_tx() + ", " 
> +
>                     pgiRow.getCity_tx() + ", " + pgiRow.getStateprov_cd() 
> +
>                     ", " + pgiRow.getCountry_tx() + ", " + pgiRow.
> getPostalcode_tx());
>
>             String jsonRow = mapper.writeValueAsString(pgiRow);
>
>             if( jsonRow != null && !jsonRow.isEmpty() && !jsonRow.equals(
> "{}") ){
>                 bulkProcessor.add(new IndexRequest("rcapropertydata", 
> "rcaproperty").source(jsonRow.getBytes()));
>                 // 
> bulkProcessor.add(client.prepareIndex("rcapropertydata", 
> "rcaproperty").setSource(jsonRow));
>             }
>             else{
>                 // don't add null strings..
>                 try{
>                     System.out.println(pgiRow.toString());
>                 }
>                 catch (Exception e){
>                     System.out.println("Some error in the toString() 
> method...");
>                 }
>                 System.out.println("Some json output was null. -- " + 
> pgiRow.getProperty_id().toString());
>             }
>
>         }
>
>         bulkProcessor.flush();
>         bulkProcessor.close();
>
>     }
>
>
>
> }
>
>
>
> On Tuesday, September 9, 2014 1:57:54 PM UTC-4, Jörg Prante wrote:
>
> Check the path.data setting in config/elasticsearch.yml
>
> Jörg
>
> On Tue, Sep 9, 2014 at 7:50 PM, Joshua P <[email protected]> wrote:
>
> Just reran the indexer and found this error coming up. I'm running out of 
> disk space on the partition ES wants to write to.
>
> F38KqHhnRDWtiJCss5Wz0g -- INTERNAL_SERVER_ERROR -- 
> TranslogException[[index_type][0] Failed to write operation 
> [org.elasticsearch.index.translog.Translog$Create@6f1f6b1e]]; nested: 
> IOException[No space left on device];  -- index_type
>
> Where would I change the write location? Which config file? 
>
> On Tuesday, September 9, 2014 1:28:21 PM UTC-4, Joshua P wrote:
>
> Hi Jörg, 
>
> Can you elaborate on what you mean by I still need more fine tuning? 
>
> I've upped the heap size to 4g (in both places I mentioned before because 
> it's not clear to me which one ES actually uses). I haven't tried to index 
> again yet. 
> Other than throttling my indexing, what are some other things I need to be 
> thinking about? 
>
> On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:
>
> Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and 
> indexing around 1 million docs, you need some more fine tuning, which is 
> complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8 
> GB RAM.
>
> Jörg
>
> On Tue, Sep 9, 2014 at 5:39 PM, Joshua P <[email protected]> wrote:
>
> Here is /etc/default/elasticsearch
>
> # Run Elasticsearch as this user ID and group ID
> #ES_USER=elasticsearch
> #ES_GROUP=elasticsearch
>
> # Heap Size (defaults to 256m min, 1g max)
> ES_HEAP_SIZE=512m
>
> # Heap new generation
> #ES_HEAP_NEWSIZE=
>
> # max direct memory
> #ES_DIRECT_SIZE=
>
> # Maximum number of open files, defaults to 65535.
> MAX_OPEN_FILES=65535
>
> # Maximum locked memory size. Set to "unlimited" if you use the
> # bootstrap.mlockall option in elasticsearch.yml. You must also set
> # ES_HEAP_SIZE.
> MAX_LOCKED_MEMORY=unlimited
>
> # Maximum number of VMA (Virtual Memory Areas) a process can own
> #MAX_MAP_COUNT=262144
>
> # Elasticsearch log directory
> #LOG_DIR=/var/log/elasticsearch
>
> # Elasticsearch data directory
> #DATA_DIR=/var/lib/elasticsearch
>
> # Elasticsearch work directory
> #WORK_DIR=/tmp/elasticsearch
>
> # Elasticsearch configuration directory
> #CONF_DIR=/etc/elasticsearch
>
> # Elasticsearch configuration file (elasticsearch.yml)
> #CONF_FILE=/etc/elasticsearch/elasticsearch.yml
>
> # Additional Java OPTS
> #ES_JAVA_OPTS=
>
> # Configure restart on package upgrade (true, every other setting will 
> lead to not restarting)
> #RESTART_ON_UPGRADE=true
>
> I also see the same setting in /etc/init.d/elasticsearch. Do you know 
> which file takes priority? And what a good size would be? 
>
> On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:
>
> Hello Joshua , 
>
> I am not sure which variable you are referring to on the memory settings 
> in the config file , please paste the comment and config.
> I usually change the config from init.d script.
>
> Best approach would be to bulk index say 10,000 feeds in sync mode , wait 
> until is everything is indexed and then proceed to the next batch.
> I am not sure about the java API , but long back i used to curl to this 
> stats API and see how much request was rejected.
>
> Thanks
>           Vineeth
>
> On Tue, Sep 9, 2014 at 8:58 PM, Joshua P <[email protected]> wrote:
>
> You also said you wouldn't recommend indexing that much information at 
> once. How would you suggest breaking it up and what status should I look 
> for before doing another batch? I have to come up with some process that is 
> repeatable and mostly automated. 
>
> On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:
>
> Thanks for the reply, Vineeth! 
>
> What's a practical heap size? I've seen some people saying they set it to 
> 30gb but this confuses me because in the /etc/default/elasticsearch file, 
> the comment suggests the max is only 1gb? 
>
> I'll look into the threadpool issue. Is there a Java API for monitoring 
> Cluster Node health? Can you point me at an example or give me a link to 
> that? 
>
> Thanks! 
>
> On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:
>
> Hello Joshuva ,
>
> I have a feeling this has something to do with the threadpool.
> There is a limit on number of feeds to be queued for indexing.
>
> Try increasing the size of threadpool queue of index and bulk to a large 
> number.
> Also through cluster node API on threadpool, you can see if any request 
> has failed.
> Monitor this API for any failed request due to large volume.
>
> Threadpool - http://www.elasticsearch.org/guide/en/elasticsearch/refere
> nce/current/modules-threadpool.html
> Threadpool stats - http://www.elasticsearch.org/guide/en/elasticsearch/
> reference/current/cluster-nodes-stats.html
>
> Having said that , i wont recommend bulk indexing that much information at 
> a time and 512 MB is not going to help much.
>
> Thanks
>           Vineeth
>
> On Tue, Sep 9, 2014 at 7:48 PM, Joshua P <[email protected]> wrote:
>
> Hi there! 
>
> I'm trying to do a one-time index of about 800,000 records into an 
> instance of elasticsearch. But I'm having a bit of trouble. It continually 
> fails around 200,000 records. Looking at in the Elasticsearch Head Plugin, 
> my index goes offline and becomes unrecoverable. 
>
> For now, I have it running on a VM on my personal machine. 
>
> VM Config: 
> Ubuntu Server 14.04 64-Bit
> 8 GB RAM
> 2 Processors
> 32 GB SSD
>
> Java
> java version "1.7.0_65"
> OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-4ubuntu1~0.14.04.2
> )
> OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
>
> Elasticsearch is using mostly the defaults. This is the output of: 
> curl http://localhost:9200/_nodes/process?pretty
> {
>   "cluster_name" : "property_transaction_data",
>   "nodes" : {
>     "KlFkO_qgSOKmV_jjj5xeVw" : {
>       "name" : "Marvin Flumm",
>       "transport_address" : "inet[/192.168.133.131:9300]",
>       "host" : "ubuntu-es",
>       "ip" : "127.0.1.1",
>       "version" : "1.3.2",
>       "build" : "dee175d",
>       "http_address" : "inet[/192.168.133.131:9200]",
>       "process" : {
>         "refresh_interval_in_millis" : 1000,
>         "id" : 1092,
>         "max_file_descriptors" : 65535,
>         "mlockall" : true
>       }
>     }
>   }
> }
>
> I adjusted ES_HEAP_SIZE to 512mb. 
>
> I'm using the following code to pull data from SQL Server and index it. 
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com 
> <https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
>  -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com 
> <https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> ...

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5ed41ff2-3abd-4e6d-803c-62746ad3c54a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Bulk Indexing Problems

Reply via email to