Hi Kevin,

beside the fact that SMSCs that sends DLR before ACK with message_id are just 
plain buggy and have to fixed I would try for the
correctly implemented SMSCs to return temp error if DLR not found and they will 
retry later.

For the patch itself here are some comments:

+        if ( smpp->dlr_retry_count > 0 ) {
+            // add the DLR into the retry queue
+            struct dlr_retry *retry = gw_malloc(sizeof(struct dlr_retry));

// is not C-Style comment

+static void do_dlr_retry_cleanup (SMPP *smpp)
+{
+    long i;
+    Msg *dlrmsg = NULL;
+    for (i = 0; i < gwlist_len(smpp->dlr_to_retry); i++) {
+        struct dlr_retry *dlr;
+        dlr = gwlist_get(smpp->dlr_to_retry, i);
+        if (dlr != NULL ) {
+            if ( difftime(time(NULL),dlr->retry_time) >= 0 ) {
+                debug("bb.sms.smpp", 0,"SMPP[%s]: Attempt <%ld> of <%ld> to 
process DLR id<%s> dst<%s>, type<%d>",
+                        octstr_get_cstr(smpp->conn->id),
+                        (dlr->dlr_retry_count+1), smpp->dlr_retry_count,
+                        octstr_get_cstr(dlr->msgid),
+                        octstr_get_cstr(dlr->destination_address), 
dlr->dlrstat);
+                dlrmsg = dlr_find(smpp->conn->id,
+                    dlr->msgid, /* smsc message id */
+                    dlr->destination_address, /* destination */
+                    dlr->dlrstat, 1);
+                if (dlrmsg != NULL) {
+                    debug("bb.sms.smpp", 0,"SMPP[%s]: Found matching DLR 
id<%s> dst<%s>, type<%d>",
+                            octstr_get_cstr(smpp->conn->id), 
octstr_get_cstr(dlr->msgid),
+                            octstr_get_cstr(dlr->destination_address), 
dlr->dlrstat);
+                    dlrmsg->sms.msgdata = octstr_duplicate(dlr->msgdata);
+                    dlrmsg->sms.sms_type = report_mo;
+                    dlrmsg->sms.account = octstr_duplicate(smpp->username);
+                    bb_smscconn_receive(smpp->conn, dlrmsg);
+                    gwlist_delete(smpp->dlr_to_retry,i,1);

you don't adjust "i" therefore you will miss the next DLR in the queue

+                } else {
+                    dlr->dlr_retry_count++;
+                    if (dlr->dlr_retry_count >= smpp->dlr_retry_count) {
+                        gwlist_delete(smpp->dlr_to_retry,i,1);

ditto

+                        error(0,"SMPP[%s]: Number of retry excedeed for DLR "
+                                "id<%s> dst<%s>, type<%d> discarding it",
+                                octstr_get_cstr(smpp->conn->id), 
octstr_get_cstr(dlr->msgid),
+                                octstr_get_cstr(dlr->destination_address), 
dlr->dlrstat);
+                    } else {
+                        dlr->retry_time = time(NULL) + 
smpp->dlr_retry_interval;
+                        debug("bb.sms.smpp", 0,"SMPP[%s]: DLR not yet found 
re-queueing it id<%s> dst<%s>, type<%d>",
+                                octstr_get_cstr(smpp->conn->id), 
octstr_get_cstr(dlr->msgid),
+                                octstr_get_cstr(dlr->destination_address), 
dlr->dlrstat);
+                    }
+                }
+           }
+        }
+    }
+}

+ * execute in a loop do_dlr_retry_cleanup and wait 1sec till the smpp conn is
+ * not quitting
+ */
+static void dlr_retry_thread(void *arg)
+{
+    SMPP *smpp;
+    smpp = arg;
+    /* Make sure we log into our own log-file if defined */
+    log_thread_to(smpp->conn->log_idx);
+    while (!smpp->quitting) {
+        do_dlr_retry_cleanup(smpp);
+        gwthread_sleep(1.0);
+    }
+}

Please make retry frequency configurable I think 1 sec is too low for some 
configurations when DLR DB should be busy.
I see you have already dlr_retry_interval why not return value how long to 
sleep?

If we will shutdown the bearerbox then we will missing following error because 
you just drop waiting DLRs without notice:

+            error(0,"SMPP[%s]: got DLR but could not find message or was not 
interested "
+                    "in it id<%s> dst<%s>, type<%d>",
+                    octstr_get_cstr(smpp->conn->id), octstr_get_cstr(tmp),
+                    octstr_get_cstr(destination_addr), dlrstat);

Thanks,
Alex

Am 23.07.2012 um 12:55 schrieb "Stevenard, Kevin (Kevin)" 
<[email protected]>:

> Hello Andreas,
> 
> The problem of the micro lag in the replication (I say micro because our 
> setup is not overloaded and most of the time lag it is not exceeding 100ms) 
> is not the main motivation of this patch. Even with a single DB server we 
> still have to retry DLRs. 
> 
> Our main problem is that we are connected to various SMSC / SMPP gateway and 
> some of them have weird behavior for example few of them sends sometimes the 
> deliver_sm before the submit_sm_resp ... for some of those gateways we have 
> seen that it is always on some specific cases like the end-user is not 
> available or not anymore reachable.
> 
> Some gateways also send deliver_sm in random / roundrobin / "load related" 
> manner. Or start to be crazy when we send too much load even if we stay in 
> the boundaries imposed by them.
> 
> Concerning the number of binds we are most of the time obliged to follow the 
> operator policy, so if we should have 2 binds per server on two different 
> SMSC with identical IDs then we do it... 
> 
> We also like to help operators and give feedbacks when their SMSC or gateway 
> when not acting properly but most of the time they can't or don't want to fix 
> those kind of issues.
> 
> For your information, even with high traffic or spike of load we don't use 
> too much that retry queue (max of 300 / day, and most of them are 
> successfully processed at the first retry) but each DLR positive or negative 
> is valuable for us and needs to be processed.
> 
> -----Original Message-----
> From: Andreas Fink [mailto:[email protected]] 
> Sent: lundi 23 juillet 2012 12:10
> To: Stevenard, Kevin (Kevin)
> Cc: [email protected]
> Subject: Re: dlr-retry-queue patch
> 
> Dear Kevin,
> 
> The DLR database entry is a temporary one. The Kannel instance which sends 
> the submit_sm is the entity which also will get the delivery reports. So you 
> should not have any issues in replication. The only reason why you would have 
> this is if you have multiple kannel's on multiple machines connecting to the 
> same smsc with the same user-id. The work around is simply to use different 
> user-id's so the SMSC will route the delivery report back to the kannel which 
> originally submitted the SMS. Then  every kannel can and should have his 
> private DLR storage. Replication might be useful for backup purposes or 
> failover but that's about it. Should one kannel fail, another kannel on 
> another machine can be fired up with the failed kannel's username and 
> database access to take over that load.
> 
> Maybe you explain better what you are trying to achieve...
> 
> 
> 
> On 23.07.2012, at 11:43, Stevenard, Kevin (Kevin) wrote:
> 
>> Hello Kannel users,
>> 
>> I have written a patch on the smsc_smpp connector to implement a DLR retry 
>> queue. In our services we are highly dependent of delivery reports. We were 
>> first using a patch 
>> (http://www.blogalex.com/wp-content/uploads/2009/05/kannel-dlr-retry.patch) 
>> but we were not happy with the following:
>> - blocking io thread due to the sleep implemented at dlr search level
>> - preventing instant deliver_sm_resp (due also to the sleep blocking io 
>> thread)
>> - system wide retry, can not be configured per account
>> 
>> As our setup is composed of several servers in a 'cluster' each server is 
>> running Kannel with several smpp/emi binds and each server is writing 
>> delivery reports in a local MySQL backend. We then use master-master 
>> replication to spread dlr entries on all servers.
>> As we have multiple servers and multiple binds its possible to get a dlr 
>> through a deliver_sm before we received the submit_sm_resp containing the 
>> message id or before the dlr entry is written and replicated in other db 
>> servers (even with a lag < 1sec).
> 
>> 
>> So the patch is adding a queue per smscconn and creates for each of them a 
>> dedicated thread to re-process dlr that have not been found in the dlr 
>> store, if not configured no queue and no thread are created. 
>> 
>> To configure it:
>> dlr-retry-count: Number of attempts to process a delivery report if not 
>> found in the delivery reports store. Defaults to 0 times (disabled).
>> dlr-retry-interval: This timer specifies the interval time between delivery 
>> reports retries. Defaults to 60 seconds.
>> 
>> #sample in a group = smsc && smsc = smpp:
>> dlr-retry-interval = 60
>> dlr-retry-count = 3
>> 
>> 
>> <dlr-retry-queue.patch>
> 
> 
> 


Reply via email to